Title: Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS January 2003
1Essential Bioinformatics and Biocomputing
(LSM2104 Section I) Biological Databases
andBioinformatics SoftwareProf. Chen Yu
ZongTel 6874-6877Email csccyz_at_nus.edu.sghttp
//xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1,
NUSJanuary 2003
2Essential Bioinformatics and Biocomputing
(LSM2104 Section I) Four lecturesPart 1
Biological databasesLecture 2. Biological
information and databasesLecture 3. More
databases, retrieval systems, and database
searching Part 2 SoftwareLecture 4.
Examples of the applications of bioinformatics
software and basic
principlesLecture 5. Overview of bioinformatics
software
3Part 1 Biological databases
- Part 1 outline
- Biological information and databases
- Overview and definition, types of biological
databases - 2. Popular databases, records, data format
- Genbank, SwissProt, OMIM, PDB, KEGG, BIND, Pfam,
PROSITE, PubMed - 3. Accessing biological databases, retrieval
systems - Entrez, SRS
- 4. Searching biological databases
- Data quality, coverage, redundancy, errors
- Textbook
- --T.K.Atwood and D.J. Parry Smith, Introduction
to Bioinformatics. - Biological databases chapters 3 and 4
4Biological Information
- Cancer as an
- example
- Genes
- Growth Genes
- Tumor
- suppressor genes
- Proteins
- Growth Factors
- Enzymes
- Receptors
- Pathways
- Cell death
- Systems
- Immune system
- Blood supply
5Biological Information
- Nucleic acids
- DNA sequence, genes, gene products (proteins),
mutation, gene coding, distribution patterns,
motifs - Genomics genome, gene structure and expression,
genetic map, genetic disorder - RNA sequence, secondary structure, 3D structure,
interactions - Proteins
- Protein sequence, corresponding gene, secondary
structure, 3D structure, function, motifs,
homology, interactions - Proteomics expression profile, proteins in
disease processes etc. - Ligands and drugs (inhibitors, activators,
substrates, metabolites)
6Biological Information
- Pathways
- Molecular networks, biological chain events,
regulation, feedback, kinetic data - Function
- Binding sites, interactions, molecular action
(binding, chemical reaction, etc.) - Biological effect (signaling, transport,
feedback, regulation, modification, etc.) - Functional relationship, protein families,
motifs, and homologs
7Biological databases
- Purpose
- To disseminate biological data and information
- To provide biological data in computer-readable
form - To allow analysis of biological data
- A database needs to have at minimum a
specific tool for searching and data extraction. - Web pages, books, journal articles, tables, text
files, and spreadsheet files cannot be considered
as databases - Reading materials
- Baxevanis AD.The Molecular Biology Database
Collection 2002 update. Nucleic Acids Res. 2002
Jan 130(1)1-12.
8Biological databases
- Lists of biological databases
- INFOBIOGEN Catalog of Databases
http//www.infobiogen.fr/services/dbcat/ - Nucleic Acids Research Database Listing
- http//nar.oupjournals.org/cgi/content/fu
ll/30/1/1/DC1 - These serve as starting point of biological
databases. - More than 500 databases have been catalogued to
date and those from the two listings satisfy
minimal criteria for the content, access, and
quality. - Other sites as a starting point.
9Biological databases
- INFOBIOGEN Catalog of Databases
-
- Type of database
No of records - DNA
87 - RNA
29 - Protein
94 - Genomic
58 - Mapping
29 - Protein
structure 18 - Literature
43 - Miscellaneous
153 -
- Total
511
10Biological databases- in Nucleic Acids Research
Type of database
No of records Major Sequence Repositories
7 Comparative Genomics
7 Gene Expression
20 Gene Identification and Structure
30 Genetic and Physical Maps
10 Genomic Databases
48 Intermolecular Interactions
5 Metabolic Pathways and Cellular Regulation
12 Mutation Databases 33 Pathology
8 Protein Databases 50 Protein Sequence
Motifs 18 Proteome Resources 7 RNA
Sequences 26 Retrieval Systems and Database
Structure 3 Structure 32 Transgenics
2 Varied Biomedical Content 18 TOTAL
336
11Literature databases PubMed (MedLine)
- 1. It contains entries for more than 11 million
abstracts of scientific publications. - 2. It enables user to do keyword searches,
provides links to a selection of full articles,
and has text mining capabilities, e.g. provides
links to related articles, and GenBank entries,
among others. - 3. Efficient searching PubMed requires some
skill. For example, searching with a keyword
interleukin returns 108,366 matches.
12PubMed web-site (http//www3.ncbi.nlm.nih.gov/entr
ez/query.fcgi?dbPubMed )
13PubMed Search (http//www3.ncbi.nlm.nih.gov/entrez
/query.fcgi?dbPubMed )
Cancer treatment by targeting blood
supply Cancer growth depends on blood supply
(why?) and thus requires the growth of new blood
vessels angiogenesis Proteins involved in
angiogenesis may be potential anticancer
targets You can find some of these targets by
searching Pubmed Key word cancer angiogenesis
enzyme drug produces 856 entries
14Nucleic Acids databases
- What info are in these databases
- DNA sequence, genes, gene products (proteins),
mutation, gene coding, distribution patterns,
motifs - Genomics genome, gene structure and expression,
genetic map, genetic disorder - RNA sequence, secondary structure, 3D structure,
interactions
15Nucleic Acids databases
- DNA databases GenBank, EMBL, DDBJ
- 1. General purpose databases focusing on DNA
sequences and their properties - 2. GenBank, EMBL-bank and DDBJ exchange data to
ensure comprehensive worldwide coverage and
accession numbers are managed consistently
between the three centers. - Reading materials
- Textbook, chapter 4
16DNA databases
- GenBank database (http//www.ncbi.nih.gov/Genbank/
) - Contains publicly available DNA sequences from
more than 100,000 organisms. - Also contains derived protein sequences, and
annotations describing biological, structural,
and other relevant features. - Accessible through Entrez, NCBIs integrated
retrieval system (studied later) - Sequence similarity search tools BLAST (studied
later) - EMBL nucleotide sequence database
(http//www.ebi.ac.uk/embl/) - Contains nucleotide sequences collected from all
public sources. - Accessible through Sequence Retrieval System
(SRS) which allows keyword searching (studied
later) - Sequence similarity search tools Blitz, Fasta,
and BLAST (studied later)
17DNA databases GenBank Web page
18DNA databases
- An Example from GenBank flat file
- Human Alpha-Lactalbumin gene
- This protein is a complex of 2 proteins A and B.
In the absence of the - B protein, the enzyme catalyzes the transfer of
- galactose from UDP-galactose to
Nacetylglucosamine (cf. EC 2.4.1.90).
19A GenBank entry HEADER
20GenBank Entry Links provided in the Header
- MapViewer find the gene position in chromosome
- Related Sequences other entries related to this
gene (or sequence) - OMIM link to catalog of human genes and genetic
disorders - Protein retrieve protein record from GenPept
- Medline and PubMed literature abstracts related
to this gene - Taxonomy Classification of organisms
- UniGene Unified gene data
- UniSTS Unified sequence tagged sites, marker
and mapping data - LinkOut links to publishers, aggregators
libraries, biological databases, sequence
centers, and other Web resources - REFSEQ reference sequence standards
- Note These links are representative. Other links
may also be found in GenBank entries.
21GenBank entry - FEATURES
22GenBank Entry Links provided in the Feature
section
- LocusID locus and display of genomic and mRNA
sequences - MIM Link to OMIM description, other entries for
this sequence - EC_number link to the corresponding cataloged
enzymes - Protein_id retrieve protein record from GenPept
- CD conserved protein domain (SMART),
- CDD conserved protein domain (Pfam).
23Biological databases GenBank - SEQUENCE
24GenBank - NOTES
- Majority of GenBank entries have similar form to
our example. - When accessing the database, the following needs
to be noticed - Some entries are huge, containing as much as
30,000 lines. (NT_021877 Homo sapiens chromosome
1 working draft sequence segment) - Some entries have contig information instead of
sequence information. (NT_021877 Homo sapiens
chromosome 1 working draft sequence segment) - Some entries are derived from cDNA sequences and
thus represent putative genes/proteins. These
should be used with caution. (AK007430. Mus
musculus 10 d...gi12840976). - Some annotations are predicted using automated
analysis. These should also be used with caution.
(XM_131483 Mus musculus simi...gi20832685).
25GenBank - Statistics
- Year Base Pairs Sequences
- 680338
606 - 101008486 78608
- 11101066288 10106023
- 15849921438 14976310
- Data size is large and increases fast
26Biological Databases
- Database Searching
- Databases must have methods for accessing and
extracting data stored. - The most basic search is keyword searching
- Keywords can be any word that occurs somewhere in
the database - records. It can be the name of the gene or
protein (e.g. lactalbumin), - species (e.g.homo sapiens, human), a taxonomy
term - (e.g.primates), or a word from the reference
title (e.g. cancer) - Others include Entry Id number, sequence
- Databases typically have hyperlinks that provide
access to additional information related to the
entry from other sources.
27Biological databases OMIM Online Mendelian
Inheritance in Man (http//www.ncbi.nlm.nih.gov/Om
im/)
- The OMIM database contains abstracts and texts
describing genetic disorders to support genomics
efforts and clinical genetics. It provides gene
maps, and known disorder maps in tabular listing
formats. Contains keyword search. - Hamosh A. et al. Online Mendelian Inheritance in
Man (OMIM), a knowledge base - of human genes and genetic disorders Nucleic
Acids Res. 2002 30 52-55.
28Biological databases OMIM web-page
29Biological databases OMIM search engine
30Biological databases OMIM statistics
- All Entries 14088
- Established Gene Locus 10476
- Phenotype Descriptions 1194
- Other Entries 2418
31Biological databases
- Protein databases
- SWISS-PROT (http//us.expasy.org/sprot/sprot-top.h
tml) is a curated database focusing on high level
of annotation (sequence, function, structure,
post-translational modifications, variants, etc.)
of proteins. - TrEMBL is Computer-annotated supplement to
SWISS-PROT - Reading materials Textbook, chapter 3
32Protein databases
- What are in these databases
- Protein sequence, corresponding gene, secondary
structure, 3D structure, function, motifs,
homology, interactions - Proteomics expression profile, proteins in
disease processes etc. - Ligands and drugs (inhibitors, activators,
substrates, metabolites)
33Protein databases SWISS-PROT
- Notes
- SWISS-PROT provides high-quality annotations and
detailed info about sequence, structural,
functional, and other properties of proteins. - It provides a rich set of links to other sources
of information on SWISS-PROT entries.
Unfortunately, some of the links will not work at
all times, because of the dynamical change of the
Web. - It also provides a rich set of protein analysis
tools.
34SWISS-PROT web-page
35SWISS-PROT entry P00709
36(No Transcript)
37SWISS-PROT entry P00709
38SWISS-PROT entry P00709
39Biological databases Protein structure
database PDB (http//www.pdb.org)
- More than 18,000 macromolecular structures on
proteins, peptides, viruses, protein/nucleic
acids complexes, nucleic acids, and
carbohydrates. - Among the oldest databases the first structure
was deposited in 1972. - New deposited structures has been steadily
growing (3298 in 2001, and 1486 Jan 1-June 5,
2002). - Determined mainly by the X-ray diffraction and
NMR. - It Contains tools for keyword search,
comprehensive visualization, and information
extraction such as sequence, geometry, and
structural neighbors details.
40Biological databases PDB web-pagehttp//www.rcsb
.org/pdb/
41Biological databases A PDB entryhttp//www.rcsb.
org/pdb/
42Biological databases PDB statistics
43Biological databases Summary of Todays lecture
- Types of Biological information, data and
databases - Simple data retrieval method.
- Popular databases Pubmed, Genbank, SwissProt,
OMIM, PDB - Statistics
- Large number of publications (MEDLINE gt12M since
1960) - Large amount of data for sequence (DNA gt14M,
Protein gt 120K) - Fair amount of data for 3D structure (Protein
gt14K, Nucleic acid gt1K)