Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS January 2003 - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS January 2003

Description:

Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: 6874-6877 – PowerPoint PPT presentation

Number of Views:385
Avg rating:3.0/5.0
Slides: 44
Provided by: xinCz3Nu
Category:

less

Transcript and Presenter's Notes

Title: Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS January 2003


1
Essential Bioinformatics and Biocomputing
(LSM2104 Section I) Biological Databases
andBioinformatics SoftwareProf. Chen Yu
ZongTel 6874-6877Email csccyz_at_nus.edu.sghttp
//xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1,
NUSJanuary 2003
2
Essential Bioinformatics and Biocomputing
(LSM2104 Section I) Four lecturesPart 1
Biological databasesLecture 2. Biological
information and databasesLecture 3. More
databases, retrieval systems, and database
searching Part 2 SoftwareLecture 4.
Examples of the applications of bioinformatics
software and basic
principlesLecture 5. Overview of bioinformatics
software
3
Part 1 Biological databases
  • Part 1 outline
  • Biological information and databases
  • Overview and definition, types of biological
    databases
  • 2. Popular databases, records, data format
  • Genbank, SwissProt, OMIM, PDB, KEGG, BIND, Pfam,
    PROSITE, PubMed
  • 3. Accessing biological databases, retrieval
    systems
  • Entrez, SRS
  • 4. Searching biological databases
  • Data quality, coverage, redundancy, errors
  • Textbook
  • --T.K.Atwood and D.J. Parry Smith, Introduction
    to Bioinformatics.
  • Biological databases chapters 3 and 4

4
Biological Information
  • Cancer as an
  • example
  • Genes
  • Growth Genes
  • Tumor
  • suppressor genes
  • Proteins
  • Growth Factors
  • Enzymes
  • Receptors
  • Pathways
  • Cell death
  • Systems
  • Immune system
  • Blood supply

5
Biological Information
  • Nucleic acids
  • DNA sequence, genes, gene products (proteins),
    mutation, gene coding, distribution patterns,
    motifs
  • Genomics genome, gene structure and expression,
    genetic map, genetic disorder
  • RNA sequence, secondary structure, 3D structure,
    interactions
  • Proteins
  • Protein sequence, corresponding gene, secondary
    structure, 3D structure, function, motifs,
    homology, interactions
  • Proteomics expression profile, proteins in
    disease processes etc.
  • Ligands and drugs (inhibitors, activators,
    substrates, metabolites)

6
Biological Information
  • Pathways
  • Molecular networks, biological chain events,
    regulation, feedback, kinetic data
  • Function
  • Binding sites, interactions, molecular action
    (binding, chemical reaction, etc.)
  • Biological effect (signaling, transport,
    feedback, regulation, modification, etc.)
  • Functional relationship, protein families,
    motifs, and homologs

7
Biological databases
  • Purpose
  • To disseminate biological data and information
  • To provide biological data in computer-readable
    form
  • To allow analysis of biological data
  • A database needs to have at minimum a
    specific tool for searching and data extraction.
  • Web pages, books, journal articles, tables, text
    files, and spreadsheet files cannot be considered
    as databases
  • Reading materials
  • Baxevanis AD.The Molecular Biology Database
    Collection 2002 update. Nucleic Acids Res. 2002
    Jan 130(1)1-12.

8
Biological databases
  • Lists of biological databases
  • INFOBIOGEN Catalog of Databases
    http//www.infobiogen.fr/services/dbcat/
  • Nucleic Acids Research Database Listing
  • http//nar.oupjournals.org/cgi/content/fu
    ll/30/1/1/DC1
  • These serve as starting point of biological
    databases.
  • More than 500 databases have been catalogued to
    date and those from the two listings satisfy
    minimal criteria for the content, access, and
    quality.
  • Other sites as a starting point.

9
Biological databases
  • INFOBIOGEN Catalog of Databases
  • Type of database
    No of records
  • DNA
    87
  • RNA
    29
  • Protein
    94
  • Genomic
    58
  • Mapping
    29
  • Protein
    structure 18
  • Literature
    43
  • Miscellaneous
    153
  • Total
    511

10
Biological databases- in Nucleic Acids Research
Type of database
No of records Major Sequence Repositories
7 Comparative Genomics
7 Gene Expression
20 Gene Identification and Structure
30 Genetic and Physical Maps
10 Genomic Databases
48 Intermolecular Interactions
5 Metabolic Pathways and Cellular Regulation
12 Mutation Databases 33 Pathology
8 Protein Databases 50 Protein Sequence
Motifs 18 Proteome Resources 7 RNA
Sequences 26 Retrieval Systems and Database
Structure 3 Structure 32 Transgenics
2 Varied Biomedical Content 18 TOTAL
336
11
Literature databases PubMed (MedLine)
  • 1. It contains entries for more than 11 million
    abstracts of scientific publications.
  • 2. It enables user to do keyword searches,
    provides links to a selection of full articles,
    and has text mining capabilities, e.g. provides
    links to related articles, and GenBank entries,
    among others.
  • 3. Efficient searching PubMed requires some
    skill. For example, searching with a keyword
    interleukin returns 108,366 matches.

12
PubMed web-site (http//www3.ncbi.nlm.nih.gov/entr
ez/query.fcgi?dbPubMed )
13
PubMed Search (http//www3.ncbi.nlm.nih.gov/entrez
/query.fcgi?dbPubMed )
Cancer treatment by targeting blood
supply Cancer growth depends on blood supply
(why?) and thus requires the growth of new blood
vessels angiogenesis Proteins involved in
angiogenesis may be potential anticancer
targets You can find some of these targets by
searching Pubmed Key word cancer angiogenesis
enzyme drug produces 856 entries
14
Nucleic Acids databases
  • What info are in these databases
  • DNA sequence, genes, gene products (proteins),
    mutation, gene coding, distribution patterns,
    motifs
  • Genomics genome, gene structure and expression,
    genetic map, genetic disorder
  • RNA sequence, secondary structure, 3D structure,
    interactions

15
Nucleic Acids databases
  • DNA databases GenBank, EMBL, DDBJ
  • 1. General purpose databases focusing on DNA
    sequences and their properties
  • 2. GenBank, EMBL-bank and DDBJ exchange data to
    ensure comprehensive worldwide coverage and
    accession numbers are managed consistently
    between the three centers.
  • Reading materials
  • Textbook, chapter 4

16
DNA databases
  • GenBank database (http//www.ncbi.nih.gov/Genbank/
    )
  • Contains publicly available DNA sequences from
    more than 100,000 organisms.
  • Also contains derived protein sequences, and
    annotations describing biological, structural,
    and other relevant features.
  • Accessible through Entrez, NCBIs integrated
    retrieval system (studied later)
  • Sequence similarity search tools BLAST (studied
    later)
  • EMBL nucleotide sequence database
    (http//www.ebi.ac.uk/embl/)
  • Contains nucleotide sequences collected from all
    public sources.
  • Accessible through Sequence Retrieval System
    (SRS) which allows keyword searching (studied
    later)
  • Sequence similarity search tools Blitz, Fasta,
    and BLAST (studied later)

17
DNA databases GenBank Web page
18
DNA databases
  • An Example from GenBank flat file
  • Human Alpha-Lactalbumin gene
  • This protein is a complex of 2 proteins A and B.
    In the absence of the
  • B protein, the enzyme catalyzes the transfer of
  • galactose from UDP-galactose to
    Nacetylglucosamine (cf. EC 2.4.1.90).

19
A GenBank entry HEADER
20
GenBank Entry Links provided in the Header
  • MapViewer find the gene position in chromosome
  • Related Sequences other entries related to this
    gene (or sequence)
  • OMIM link to catalog of human genes and genetic
    disorders
  • Protein retrieve protein record from GenPept
  • Medline and PubMed literature abstracts related
    to this gene
  • Taxonomy Classification of organisms
  • UniGene Unified gene data
  • UniSTS Unified sequence tagged sites, marker
    and mapping data
  • LinkOut links to publishers, aggregators
    libraries, biological databases, sequence
    centers, and other Web resources
  • REFSEQ reference sequence standards
  • Note These links are representative. Other links
    may also be found in GenBank entries.

21
GenBank entry - FEATURES
22
GenBank Entry Links provided in the Feature
section
  • LocusID locus and display of genomic and mRNA
    sequences
  • MIM Link to OMIM description, other entries for
    this sequence
  • EC_number link to the corresponding cataloged
    enzymes
  • Protein_id retrieve protein record from GenPept
  • CD conserved protein domain (SMART),
  • CDD conserved protein domain (Pfam).

23
Biological databases GenBank - SEQUENCE
24
GenBank - NOTES
  • Majority of GenBank entries have similar form to
    our example.
  • When accessing the database, the following needs
    to be noticed
  • Some entries are huge, containing as much as
    30,000 lines. (NT_021877 Homo sapiens chromosome
    1 working draft sequence segment)
  • Some entries have contig information instead of
    sequence information. (NT_021877 Homo sapiens
    chromosome 1 working draft sequence segment)
  • Some entries are derived from cDNA sequences and
    thus represent putative genes/proteins. These
    should be used with caution. (AK007430. Mus
    musculus 10 d...gi12840976).
  • Some annotations are predicted using automated
    analysis. These should also be used with caution.
    (XM_131483 Mus musculus simi...gi20832685).

25
GenBank - Statistics
  • Year Base Pairs Sequences
  • 680338
    606
  • 101008486 78608
  • 11101066288 10106023
  • 15849921438 14976310
  • Data size is large and increases fast

26
Biological Databases
  • Database Searching
  • Databases must have methods for accessing and
    extracting data stored.
  • The most basic search is keyword searching
  • Keywords can be any word that occurs somewhere in
    the database
  • records. It can be the name of the gene or
    protein (e.g. lactalbumin),
  • species (e.g.homo sapiens, human), a taxonomy
    term
  • (e.g.primates), or a word from the reference
    title (e.g. cancer)
  • Others include Entry Id number, sequence
  • Databases typically have hyperlinks that provide
    access to additional information related to the
    entry from other sources.

27
Biological databases OMIM Online Mendelian
Inheritance in Man (http//www.ncbi.nlm.nih.gov/Om
im/)
  • The OMIM database contains abstracts and texts
    describing genetic disorders to support genomics
    efforts and clinical genetics. It provides gene
    maps, and known disorder maps in tabular listing
    formats. Contains keyword search.
  • Hamosh A. et al. Online Mendelian Inheritance in
    Man (OMIM), a knowledge base
  • of human genes and genetic disorders Nucleic
    Acids Res. 2002 30 52-55.

28
Biological databases OMIM web-page
29
Biological databases OMIM search engine
30
Biological databases OMIM statistics
  • All Entries 14088
  • Established Gene Locus 10476
  • Phenotype Descriptions 1194
  • Other Entries 2418

31
Biological databases
  • Protein databases
  • SWISS-PROT (http//us.expasy.org/sprot/sprot-top.h
    tml) is a curated database focusing on high level
    of annotation (sequence, function, structure,
    post-translational modifications, variants, etc.)
    of proteins.
  • TrEMBL is Computer-annotated supplement to
    SWISS-PROT
  • Reading materials Textbook, chapter 3

32
Protein databases
  • What are in these databases
  • Protein sequence, corresponding gene, secondary
    structure, 3D structure, function, motifs,
    homology, interactions
  • Proteomics expression profile, proteins in
    disease processes etc.
  • Ligands and drugs (inhibitors, activators,
    substrates, metabolites)

33
Protein databases SWISS-PROT
  • Notes
  • SWISS-PROT provides high-quality annotations and
    detailed info about sequence, structural,
    functional, and other properties of proteins.
  • It provides a rich set of links to other sources
    of information on SWISS-PROT entries.
    Unfortunately, some of the links will not work at
    all times, because of the dynamical change of the
    Web.
  • It also provides a rich set of protein analysis
    tools.

34
SWISS-PROT web-page
35
SWISS-PROT entry P00709
36
(No Transcript)
37
SWISS-PROT entry P00709
38
SWISS-PROT entry P00709
39
Biological databases Protein structure
database PDB (http//www.pdb.org)
  • More than 18,000 macromolecular structures on
    proteins, peptides, viruses, protein/nucleic
    acids complexes, nucleic acids, and
    carbohydrates.
  • Among the oldest databases the first structure
    was deposited in 1972.
  • New deposited structures has been steadily
    growing (3298 in 2001, and 1486 Jan 1-June 5,
    2002).
  • Determined mainly by the X-ray diffraction and
    NMR.
  • It Contains tools for keyword search,
    comprehensive visualization, and information
    extraction such as sequence, geometry, and
    structural neighbors details.

40
Biological databases PDB web-pagehttp//www.rcsb
.org/pdb/
41
Biological databases A PDB entryhttp//www.rcsb.
org/pdb/
42
Biological databases PDB statistics
43
Biological databases Summary of Todays lecture
  • Types of Biological information, data and
    databases
  • Simple data retrieval method.
  • Popular databases Pubmed, Genbank, SwissProt,
    OMIM, PDB
  • Statistics
  • Large number of publications (MEDLINE gt12M since
    1960)
  • Large amount of data for sequence (DNA gt14M,
    Protein gt 120K)
  • Fair amount of data for 3D structure (Protein
    gt14K, Nucleic acid gt1K)
Write a Comment
User Comments (0)
About PowerShow.com