Title: Essential Bioinformatics and Biocomputing LSM2104: Section I Lecture 3: More biological databases, r
1Essential Bioinformatics and Biocomputing
(LSM2104 Section I) Lecture 3 More
biological databases, retrieval systems and
database searching
2Biological databases Function and pathways
databases - KEGG
- KEGG (http//www.genome.ad.jp/kegg/kegg2.html)
database links genetic - info with cellular functions. It provides keyword
and pre-calculated sequence - comparison searches.
- It consists of several interconnected databases
- PATHWAY contains info on metabolic and regulatory
networks. - GENES contains information on genes and proteins.
- LIGAND contains information on chemical compounds
and reactions involved in cellular processes. - EXPRESSION and BRITE contain micro-array gene
expression data. - SSDB helps identify protein coding genes.
- It has an integrated database retrieval system
DBGET
3KEGG web-page
4KEGG Human TCA Pathway
5Biological databases BINDBiomolecular
Interaction Network Database http//www.bind.ca/
- Stores descriptions of interactions, molecular
complexes and pathways. - Provides search tools.
- PreBIND locates literature sources
- Show me a list of all of the papers in PubMed
that are about my protein of - interest. Then classify all of these papers and
tell me which ones are likely - to contain interaction information.
- Finally, identify all of the other proteins
mentioned in these papers and - indicate whether these proteins might interact
with my protein of interest - Bader GD, et al. BIND The Biomolecular
Interaction Network Database. - Nucleic Acids Res. 2001 29(1)242-5.
6Biological databases BINDBiomolecular
Interaction Network Database http//www.bind.ca/
- 3.Its Blast searches BIND database for similarity
to a query sequence. - BIND is at the forefront of the proteomics
efforts and is expected - to grow from the large-scale proteomic data.
- Bader GD, et al. BIND The Biomolecular
Interaction Network Database. - Nucleic Acids Res. 2001 29(1)242-5.
7BIND website
8BIND Statistics
- Database
Record Count - Interaction Database
11255 - Biomolecular Pathway Database 8
- Molecular Complex Database 851
- Organisms represented
12 - GI Database
4651 - DI Database
0 - Publication Database
428
9Protein family/domain databases Sequence
alignmentPfam (http//www.sanger.ac.uk/Software/P
fam/)
- Pfam is a collection of multiple protein sequence
alignments and statistical models that can be
used to classify protein families and domains. - Descriptions of protein domains
- Given an established SWISSPROT sequence, Pfam
shows pre-computed domain structure of the
protein. - Given a completely new protein sequence, Pfam
computes a domain structure.
10Pfam Web-site
11Protein family/domain databases Sequence
patterns PROSITE ( http//ca.expasy.org/prosite/
)
- Protein families and domains. It consists of
biologically significant sites, patterns and
profiles that help to reliably identify to which
known protein family (if any) a new sequence
belongs. - It currently contains patterns and profiles
specific for more than a thousand protein
families or domains. - An example of a pattern (motif)
- W-x(9,11)-VFY-FYW-x(6,7)-GSTNE-GSTQCR-
FYW-x(2)-P
12Protein sequence motif databases-PROSITE
- A profile is a matrix derived from multiple
alignments
13PROSITE web-site
14PROSITE web-site
15Biological data retrieval systems Entrez
http//www.ncbi.nlm.nih.gov/Database/index.html
- A retrieval system for searching a number of
inter-connected databases at the NCBI. It
provides access to - PubMed The biomedical literature (Medline)
- Genbank Nucleotide sequence database
- Protein sequence database
- Structure three-dimensional macromolecular
structures - Genome complete genome assemblies
- PopSet population study data sets
- OMIM Online Mendelian Inheritance in Man
- Taxonomy organisms in GenBank
- Books online books
- ProbeSet gene expression and microarray datasets
- 3D Domains domains from Entrez Structure
- UniSTS markers and mapping data
- SNP single nucleotide polymorphisms
- CDD conserved domains
- 2. Entrez allows users to perform various
searches.
16Entrez Interface
17Biological data Retrieval systems
SRShttp//srs.ebi.ac.uk/
- SRS is a retrieval system for searching several
linked databases at the EBI. Similarly to Entrez,
it provides access to various databases and
enables various keyword, sequence similarity or
class searches.
18SRS interface
19Biological databases Database searching
- Database searching can be used to answer the
kinds of question like - What is the sequence of human IL-10?
- What is the gene coding for human IL-10?
- Is the function of human IL-10 known? What is it?
- Are there any variants of human IL-10?
- Who sequenced this gene?
- What are the differences between IL-10 in human
and in other species? - Which species are known to have IL-10?
- Is the structure of IL-10 known?
- What are structural and functional domains of the
IL-10? - Are there any motifs in the sequence that explain
their properties? - What is an upstream region of IL-10 containing
transcriptional regulation sites?
20Biological databases Database searching
- For well studied molecule such as IL-10, we
expect to extract much of the well-known facts. - These searches are useful for characterizing
newly identified sequences - Notes
- Multiple errors can be found in database
entries. Some of these errors are introduced with
the submission of sequences to databases. Some
errors are due to naming conventions (or lack of
these). Some errors are due to poor links between
databases. - Users should take data extracted from databases
with care and compare these results with
information from other databases, journal
articles, and other sources.
21Biological databases Keyword searching
- Search DNA and protein databases with keywords
(10-July-2002)
22Biological databases Keyword searching
- Notes
- GenPept is protein translation of GenBank. SPTR
is SWISS_PROT plus protein translation of EMBL
sequences. - Different databases contain different, but
overlapping, sets of entries. The same sequence
may have entries in different databases - Some databases have non-redundant sections. For
example the UniGene System which automatically
partition GenBank sequences into a non-redundant
set of gene-oriented clusters. - For completeness of results usually it is
necessary to search multiple databases.
23Biological databases Database coverage
- Example Scorpion KALIOTOXIN 2 (SwissProtP45696)
24Biological databases Database errors
- Our scorpion study (Srinivasan et al., 2002) also
revealed numerous errors and missing data in the
major databases. One of the entries had an error
in sequence in journal publication, but correct
sequence in the databases.
25Biological Databases Sequence Similarity
Searching
- Proteins that have similar sequence often have
similar structure and similar - function. If we have only a protein sequence we
can deduce its - structural and functional properties by analyzing
sequences that are similar. - Sequence Structure Function Relationship
- Similar sequence Similar structure Similar
function - Why this relationship?
- Evolution involves sequence variation
- Laws of physics and chemistry defines
sequence-structure relationship - Function as defined by molecular interaction
structure-based
26Database Searching Cautionary Notes
- Some database matches happen because of chance
similarities, keywords and sequence similarity
alike. Distinguishing chance matches from
biologically significant matches is one of the
most important issues for effective use of
biological databases. - Searching GenBank by sequence similarity tool
BLAST for short, nearly exact matches, for the
sequence similarity to the names of the lecturers
of this module in last semester returned two
imperfect matches to VLADIMIR and seven perfect
matches to TINWEE.
27Database Searching Cautionary Notes
28Database Searching Cautionary Notes
- If we blindly interpret these results, we would
erroneously conclude that motif VLADIMIR may have
some functional importance for structure or
function of the strawberry vein binding virus,
and that TINWEE has to do with calciumdependent
protein kinase in rice and possibly in Legionella
pneumophila. - We would avoid conclusions like this by looking
at the similarity scores. This will be done in
more detail later in the course, for now it is
important to know that the lower the expected
value, the better the match. Anything close or
greater than 1 should be observed with suspicion.
However, sometimes matches that are not
statistically significant, still can have
biological significance. If we suspect that this
might be the case, further analysis is necessary.
29Database Searching Cautionary Notes
- Examples of chance matches virtually any string
or keyword can show matches to database
entries. We are interested only in real ones. - The same search with GenBank. Fortunately we
have statistical measures that indicate the
quality of matches. However, sometimes matches
that have low statistical significance,
nevertheless have real, biological significance.
More about that will be taught later in the
course.
30Biological databases Concluding remarks
- Biological databases represent an invaluable
resource in support of biological research. - 2. We can learn much about a particular molecule
by searching databases and using available
analysis tools - 3. A large number of databases are available for
that task. Some databases are very general, some
are more specialized, while some are very
specialized. For best results we often need to
access multiple databases.
31Biological databases Concluding remarks
- 4. Major types of databases covered in this
course are focusing on general nucleotide,
general protein, structure, pathways, molecular
interactions, protein motifs, publication, and
specialized databases. - 5. Common database search methods include keyword
matching, sequence similarity, motif searching,
and class searching. - 6. The problems with using biological databases
include incomplete information, data spread over
multiple databases, redundant information,
various errors, sometimes incorrect links, and
constant change.
32Biological databases Concluding remarks
- 7. Database standards, nomenclature, and naming
conventions are not clearly defined for many
aspects of biological information. This makes
information extraction more difficult. - 8. Retrieval systems help extract rich
information from multiple databases. Examples
include Entrez and SRS. - 9. Formulating queries is a serious issue in
biological databases. Often the quality of
results depends on the quality of the queries.
33Biological databases Concluding remarks
- 10. Statistical measures indicate the quality of
matches. Often the statistical and biological
significance are related. Sometimes, however
matches of real biological significance have low
statistical scores. - 11. Access to biological databases is so
important that today virtually every molecular
biological project starts and ends with querying
biological databases.
34Biological databases Summary of Todays lecture
- Popular databases KEGG, BIND, Pfam, PROSITE,
PUBMED - Data retrieval systems Entrez, SRS
- Database searching capability, potential
problems. - Statistics
- Protein families (gt 5K)
- Sequence patterns (gt 1.5K)
- Interactions (gt11K or 110 X 110 which is
relatively few) - Relatively small amount of data for function
(e.g. Pathways lt 200)