Essential Bioinformatics and Biocomputing LSM2104: Section I Lecture 3: More biological databases, r - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Essential Bioinformatics and Biocomputing LSM2104: Section I Lecture 3: More biological databases, r

Description:

OMIM: Online Mendelian Inheritance in Man. Taxonomy: organisms in GenBank. Books: online books ... Evolution: involves sequence variation ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 35
Provided by: dbs7
Category:

less

Transcript and Presenter's Notes

Title: Essential Bioinformatics and Biocomputing LSM2104: Section I Lecture 3: More biological databases, r


1
Essential Bioinformatics and Biocomputing
(LSM2104 Section I) Lecture 3 More
biological databases, retrieval systems and
database searching
2
Biological databases Function and pathways
databases - KEGG
  • KEGG (http//www.genome.ad.jp/kegg/kegg2.html)
    database links genetic
  • info with cellular functions. It provides keyword
    and pre-calculated sequence
  • comparison searches.
  • It consists of several interconnected databases
  • PATHWAY contains info on metabolic and regulatory
    networks.
  • GENES contains information on genes and proteins.
  • LIGAND contains information on chemical compounds
    and reactions involved in cellular processes.
  • EXPRESSION and BRITE contain micro-array gene
    expression data.
  • SSDB helps identify protein coding genes.
  • It has an integrated database retrieval system
    DBGET

3
KEGG web-page
4
KEGG Human TCA Pathway
5
Biological databases BINDBiomolecular
Interaction Network Database http//www.bind.ca/
  • Stores descriptions of interactions, molecular
    complexes and pathways.
  • Provides search tools.
  • PreBIND locates literature sources
  • Show me a list of all of the papers in PubMed
    that are about my protein of
  • interest. Then classify all of these papers and
    tell me which ones are likely
  • to contain interaction information.
  • Finally, identify all of the other proteins
    mentioned in these papers and
  • indicate whether these proteins might interact
    with my protein of interest
  • Bader GD, et al. BIND The Biomolecular
    Interaction Network Database.
  • Nucleic Acids Res. 2001 29(1)242-5.

6
Biological databases BINDBiomolecular
Interaction Network Database http//www.bind.ca/
  • 3.Its Blast searches BIND database for similarity
    to a query sequence.
  • BIND is at the forefront of the proteomics
    efforts and is expected
  • to grow from the large-scale proteomic data.
  • Bader GD, et al. BIND The Biomolecular
    Interaction Network Database.
  • Nucleic Acids Res. 2001 29(1)242-5.

7
BIND website
8
BIND Statistics
  • Database
    Record Count
  • Interaction Database
    11255
  • Biomolecular Pathway Database 8
  • Molecular Complex Database 851
  • Organisms represented
    12
  • GI Database
    4651
  • DI Database
    0
  • Publication Database
    428

9
Protein family/domain databases Sequence
alignmentPfam (http//www.sanger.ac.uk/Software/P
fam/)
  • Pfam is a collection of multiple protein sequence
    alignments and statistical models that can be
    used to classify protein families and domains.
  • Descriptions of protein domains
  • Given an established SWISSPROT sequence, Pfam
    shows pre-computed domain structure of the
    protein.
  • Given a completely new protein sequence, Pfam
    computes a domain structure.

10
Pfam Web-site
11
Protein family/domain databases Sequence
patterns PROSITE ( http//ca.expasy.org/prosite/
)
  • Protein families and domains. It consists of
    biologically significant sites, patterns and
    profiles that help to reliably identify to which
    known protein family (if any) a new sequence
    belongs.
  • It currently contains patterns and profiles
    specific for more than a thousand protein
    families or domains.
  • An example of a pattern (motif)
  • W-x(9,11)-VFY-FYW-x(6,7)-GSTNE-GSTQCR-
    FYW-x(2)-P

12
Protein sequence motif databases-PROSITE
  • A profile is a matrix derived from multiple
    alignments

13
PROSITE web-site
14
PROSITE web-site
15
Biological data retrieval systems Entrez
http//www.ncbi.nlm.nih.gov/Database/index.html
  • A retrieval system for searching a number of
    inter-connected databases at the NCBI. It
    provides access to
  • PubMed The biomedical literature (Medline)
  • Genbank Nucleotide sequence database
  • Protein sequence database
  • Structure three-dimensional macromolecular
    structures
  • Genome complete genome assemblies
  • PopSet population study data sets
  • OMIM Online Mendelian Inheritance in Man
  • Taxonomy organisms in GenBank
  • Books online books
  • ProbeSet gene expression and microarray datasets
  • 3D Domains domains from Entrez Structure
  • UniSTS markers and mapping data
  • SNP single nucleotide polymorphisms
  • CDD conserved domains
  • 2. Entrez allows users to perform various
    searches.

16
Entrez Interface
17
Biological data Retrieval systems
SRShttp//srs.ebi.ac.uk/
  • SRS is a retrieval system for searching several
    linked databases at the EBI. Similarly to Entrez,
    it provides access to various databases and
    enables various keyword, sequence similarity or
    class searches.

18
SRS interface
19
Biological databases Database searching
  • Database searching can be used to answer the
    kinds of question like
  • What is the sequence of human IL-10?
  • What is the gene coding for human IL-10?
  • Is the function of human IL-10 known? What is it?
  • Are there any variants of human IL-10?
  • Who sequenced this gene?
  • What are the differences between IL-10 in human
    and in other species?
  • Which species are known to have IL-10?
  • Is the structure of IL-10 known?
  • What are structural and functional domains of the
    IL-10?
  • Are there any motifs in the sequence that explain
    their properties?
  • What is an upstream region of IL-10 containing
    transcriptional regulation sites?

20
Biological databases Database searching
  • For well studied molecule such as IL-10, we
    expect to extract much of the well-known facts.
  • These searches are useful for characterizing
    newly identified sequences
  • Notes
  • Multiple errors can be found in database
    entries. Some of these errors are introduced with
    the submission of sequences to databases. Some
    errors are due to naming conventions (or lack of
    these). Some errors are due to poor links between
    databases.
  • Users should take data extracted from databases
    with care and compare these results with
    information from other databases, journal
    articles, and other sources.

21
Biological databases Keyword searching
  • Search DNA and protein databases with keywords
    (10-July-2002)

22
Biological databases Keyword searching
  • Notes
  • GenPept is protein translation of GenBank. SPTR
    is SWISS_PROT plus protein translation of EMBL
    sequences.
  • Different databases contain different, but
    overlapping, sets of entries. The same sequence
    may have entries in different databases
  • Some databases have non-redundant sections. For
    example the UniGene System which automatically
    partition GenBank sequences into a non-redundant
    set of gene-oriented clusters.
  • For completeness of results usually it is
    necessary to search multiple databases.

23
Biological databases Database coverage
  • Example Scorpion KALIOTOXIN 2 (SwissProtP45696)

24
Biological databases Database errors
  • Our scorpion study (Srinivasan et al., 2002) also
    revealed numerous errors and missing data in the
    major databases. One of the entries had an error
    in sequence in journal publication, but correct
    sequence in the databases.

25
Biological Databases Sequence Similarity
Searching
  • Proteins that have similar sequence often have
    similar structure and similar
  • function. If we have only a protein sequence we
    can deduce its
  • structural and functional properties by analyzing
    sequences that are similar.
  • Sequence Structure Function Relationship
  • Similar sequence Similar structure Similar
    function
  • Why this relationship?
  • Evolution involves sequence variation
  • Laws of physics and chemistry defines
    sequence-structure relationship
  • Function as defined by molecular interaction
    structure-based

26
Database Searching Cautionary Notes
  • Some database matches happen because of chance
    similarities, keywords and sequence similarity
    alike. Distinguishing chance matches from
    biologically significant matches is one of the
    most important issues for effective use of
    biological databases.
  • Searching GenBank by sequence similarity tool
    BLAST for short, nearly exact matches, for the
    sequence similarity to the names of the lecturers
    of this module in last semester returned two
    imperfect matches to VLADIMIR and seven perfect
    matches to TINWEE.

27
Database Searching Cautionary Notes
28
Database Searching Cautionary Notes
  • If we blindly interpret these results, we would
    erroneously conclude that motif VLADIMIR may have
    some functional importance for structure or
    function of the strawberry vein binding virus,
    and that TINWEE has to do with calciumdependent
    protein kinase in rice and possibly in Legionella
    pneumophila.
  • We would avoid conclusions like this by looking
    at the similarity scores. This will be done in
    more detail later in the course, for now it is
    important to know that the lower the expected
    value, the better the match. Anything close or
    greater than 1 should be observed with suspicion.
    However, sometimes matches that are not
    statistically significant, still can have
    biological significance. If we suspect that this
    might be the case, further analysis is necessary.

29
Database Searching Cautionary Notes
  • Examples of chance matches virtually any string
    or keyword can show matches to database
    entries. We are interested only in real ones.
  • The same search with GenBank. Fortunately we
    have statistical measures that indicate the
    quality of matches. However, sometimes matches
    that have low statistical significance,
    nevertheless have real, biological significance.
    More about that will be taught later in the
    course.

30
Biological databases Concluding remarks
  • Biological databases represent an invaluable
    resource in support of biological research.
  • 2. We can learn much about a particular molecule
    by searching databases and using available
    analysis tools
  • 3. A large number of databases are available for
    that task. Some databases are very general, some
    are more specialized, while some are very
    specialized. For best results we often need to
    access multiple databases.

31
Biological databases Concluding remarks
  • 4. Major types of databases covered in this
    course are focusing on general nucleotide,
    general protein, structure, pathways, molecular
    interactions, protein motifs, publication, and
    specialized databases.
  • 5. Common database search methods include keyword
    matching, sequence similarity, motif searching,
    and class searching.
  • 6. The problems with using biological databases
    include incomplete information, data spread over
    multiple databases, redundant information,
    various errors, sometimes incorrect links, and
    constant change.

32
Biological databases Concluding remarks
  • 7. Database standards, nomenclature, and naming
    conventions are not clearly defined for many
    aspects of biological information. This makes
    information extraction more difficult.
  • 8. Retrieval systems help extract rich
    information from multiple databases. Examples
    include Entrez and SRS.
  • 9. Formulating queries is a serious issue in
    biological databases. Often the quality of
    results depends on the quality of the queries.

33
Biological databases Concluding remarks
  • 10. Statistical measures indicate the quality of
    matches. Often the statistical and biological
    significance are related. Sometimes, however
    matches of real biological significance have low
    statistical scores.
  • 11. Access to biological databases is so
    important that today virtually every molecular
    biological project starts and ends with querying
    biological databases.

34
Biological databases Summary of Todays lecture
  • Popular databases KEGG, BIND, Pfam, PROSITE,
    PUBMED
  • Data retrieval systems Entrez, SRS
  • Database searching capability, potential
    problems.
  • Statistics
  • Protein families (gt 5K)
  • Sequence patterns (gt 1.5K)
  • Interactions (gt11K or 110 X 110 which is
    relatively few)
  • Relatively small amount of data for function
    (e.g. Pathways lt 200)
Write a Comment
User Comments (0)
About PowerShow.com