Title: Biological Information and Biological Databases
1Biological Information and Biological Databases
- Meena K Sakharkar
- Bioinformatics Centre
- National University of Singapore
2Biological Information
3Nature of Life Science Information
- Descriptive
- Classification and Nomenclatural
- Observational and Phenomenological
- Experimental
- Deduced/Computed
- Simulated?
- Theoretical?
4Descriptive
5Classify and Give Names
- Classification and Nomenclature
- Linnaeus - binomial nomenclature
- Group into kingdoms, phyla, classes, orders,
families, genera, species, subspecies, strains,
etc - Associate descriptions to these classification
schema, and classify according to description etc
6Observational/Phenomenological
- Like descriptive, yet more active
- Observe a lot of biological phenomenon
- Charles Darwin
- Gregor Mendel to McClintock
- Start to do some experiments
7Experimental
- From dissections to complex genetic engineering
experiments
8BioInformatics
- Deduced/Computed
- Simulated?
- Theoretical?
9What is BioInformatics?
- Many related terms and buzzwords
- A multiplicity of names
- bioinformatics
- biocomputing
- biological computing
- computational biology
- computational genomics
- biological data mining
10Overview of the challenges of Molecular Biology
Computing
- The huge dataset problem
- automated DNA sequencers
- the Human Genome Project
- bulk sequencing of cDNAs (ESTs)
11Human Genome Project
- What is the Human Genome Project?
- 15-year effort formally begun in October 1990.
coordinated by the U.S. Department of Energy and
the National Institutes of Health. - identify all the estimated 80,000 genes in human
DNA, - determine the sequences of the 3 billion chemical
bases that make up human DNA, - store this information in databases,
- develop tools for data analysis, and
- address the ethical, legal, and social issues
(ELSI) that may arise from the project.
12 - Who is head of the U.S. Human Genome Project?
- The DOE Human Genome Program is directed by Ari
Patrinos, and Francis Collins directs the NIH
Human Genome Program. - Ari Patrinos also heads the Department of Energy
Office of Biological and Environmental Research.
13- What are the comparative genome sizes of humans
and other organisms being studied?
If compiled in books, the data would fill an
estimated 200 volumes the size of a Manhattan
telephone book (at 1000 pages each), and reading
it would require 26 years working around the
clock
14(No Transcript)
15Informatics Data Collection and Interpretation
- HUMAN GENETIC DIVERSITY
- The Ultimate Human Genetic Database
- Any two individuals differ in about 3 x 106 bases
(0.1). - The population is now about 5 x 109.
- A catalog of all sequence differences would
require 15 x 1015 entries. - This catalog may be needed to find the rarest or
most complex disease genes.
16Databases
17Basic Terminology
- What is a nucleotide/protein sequence database
and - databank?
- Database is a collection of Nucleotide/protein
sequence and their Associated annotations. - Databanks
- Groups which collect, compile, maintain and
distribute the database.
18Fundamental Dogma
19Work from the Code of Life
20(No Transcript)
21Deduced and Computed Information in the Era of
Computational Biology
22(No Transcript)
23Databases
- What are the different kinds of databases and
their formats?
Nucleic Acid Sequence EMBL at EBI.
GENBANK at NCBI. DDBJ at Japan.
Protein Sequence SWISS PROT NBRF(PIR)
24Database
- Protein structure databases
- PDB
- Information on the structural data for
the proteins/nucleic acids. - whose 3-D structure solved by X-ray
crystallography/NMR - PDB database
- NRL 3D Database
- NRL_3D is a sequence-structure database.
- Can be used in conjunction with PIR.
- PDB with PIR.
25GenBank Entry
26EMBL Entry
27SwissProt Entry
28Other databases
- Genome Databases
- GDB Genome Data Bank
- OMIM
- Pattern Databases
- Prosite
- TFD
29Usage of databases
- Annotation Searches - KW, Authors, Features.
- What is the protein sequence for human insulin?
- How does the 3D structure of calmodulin look
like? - What is the genetic location of cystic fibrosis
gene? - List all introns in rat?
- Homology Searches
- Is there any protein sequence that is similar to
mine? - Is this gene known in any other species?
- Has someone already cloned this sequence?
30Usage of databases
- Pattern searches
- Does my sequence contain any known motif (that
can give me a clue about the function)? - Which known sequences contain this motif?
- Is any part of my sequence recoganised by a
transcription factor? - List all known start, splice and stop signals in
my genomic sequence - Prediction - Use the database as knowledge
database - What may the structure of my protein be?
- Secondary structure prediction
- Modeling by homology
- What is the gene structure of my genomic
sequence? - Which parts of my protein have a high
antigenicity?
31Usage of Databases
- Comparisons
- Gene Families
- Phylogenetic Trees
32GenBank Growth Chart
Bases
Year
33Evolutionary basis of Alignment
- Enable the researcher to determine if two
sequences display sufficient similarity to
justify the inference of homology. - Similarity is an observable quantity that may be
expressed as say identity or some other measure. - Homology is a conclusion drawn from this data
that the two genes share a common evolutionary
history.
34Sequence Formats
35Fasta Format
- gtSANJAY REFORMAT of SANJAY.seq check 8826
from 1 to 573 March 12, 1998 - MASSSVPPMITEEEARFEAEVSAVESWWRTDRFRLTRRPYSARDVVSLRG
TLHHSYASDQ - MAKKLWRTLKSHQSAGTASRTFGALDPVQVTMMAKHLDTIYVSGWQCSST
HTATNEPGPD - LADYPYNTVPNKVEHLFFAQLYHDRKQHEARVSMTREQRAKTPYVDYLRP
IIADGDTGFG - GATATVKLCKLFVERGAAGVHIEDQSSVTKKCGHMAGKVLVAVSEHINRL
VAARLQFDVM - GVETVLVARTDAVAATLIQSNVDLRDHQFILGATNPDFKRRSLAAVLSAA
MAAGKTGAVL - QAIEDDWLSRAGLMTFSDAVINGINRQLPEYEKQRRLNEWAAATEYSKCV
SNEQGREIAE - RLGAGEIFWDWDIARTREGFYRFRGSVEAAVVRGRAFAPHADLIWMETSS
PDLVECGKFA - QGMKASHPEIMLAYNLSPSFNWDAAGMTDEEMRDFIPRIAKMGFCWQFIT
LGGFHADALV - TDTFAREFAKQGMLAYVERIQREERNNGVDTLAHQKWSGANYYDRYLKTV
QGGISSTAAM - GKGVTEEQFKEESRTGTRGLDRGGITVNAKSRL
36GCG Format
- ckl.seq Length 473 September 15, 1999 1225
Type P Check 8103 .. - 1 MSTKYSASAE SASSYRRTFG SGLGSSIFAG
HGSSGSSGSS RLTSRVYEVT - 51 KSSASPHFSS HRASGSFGGG SVVRSYAGLG
EKLDFNLADA INQDFLNTRT - 101 NEKAELQHLN DRFASYIEKV RFLEQQNSAL
TVEIERLRGR EPTRIAELYE - 151 EEMRELRGQV EALTNQRSRV EIERDNLVDD
LQKLKLRLQE EIHQKEEAEN - 201 NLSAFRADVD AATLARLDLE RRIEGLHEEI
AFLRKIHEEE IRELQNQMQE - 251 SQVQIQMDMS KPDLTAALRD IRLQYEAIAA
KNISEAEDWY KSKVSDLNQA - 301 VNKNNEALRE AKQETMQFRH QLQSYTCEID
SLKGTNESLR RQMSEDGGAA - 351 GREAGGYQDT IARLEAEIAK MKDEMARHLR
EYQDLLNVKM ALDVEIATYR - 401 KLLEGEESRI SLPVQSFSSL SFRESSPEQH
HHQQQQPQRS SEVHSKKTVL - 451 IKTIETRDGE VVSESTQHQQ DVM
37Taxonomy Database
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53Blast Results
54Examples of the New Biology
- 1. Full genome-genome comparisons
- 2. Rapid assessment of polymorphic genetic
variations - 3. Complete construction of orthologous or
paralogous groups of genes - 4. Structure determination of large
macromolecular assemblies/complexes - 5. Dynamically simulation of realistic oligomeric
systems - 6. Rapid structural/topological clustering of
proteins - 7. Prediction of unknown molecular structures
Protein folding - 8. Computer simulation of membrane structure and
dynamic function - 9. Simulation of genetic networks and the
sensitivity of these pathways to component
stoichiometry and kinetics - 10. Integration of observations across scales of
vastly different dimensions and organization to
yield realistic environmental models for basic
biology and societal needs
55Theoretical?
- The day will dawn when we will have sufficient
information to understand how basic life
functions are integrated into a living cell, and
how such cells intercommunicate and interoperate
to function as a living whole. Then maybe, we can
start talking about theoretical biology
56Categories of BioDbs - by domain of information
- DNA
- RNA
- Protein
- Genomic Mapping
- Pathways
- Structure
- Bibliographic
- Biochemical/Molecular/Miscellaneous
57Other categories
- By category of species
- By families or superfamilies of molecules
- etc
- Demo
- http//www.infobiogen.fr/services/dbcat/
58Demonstration of BioDatabases
- Majority of Life Science databases are online,
accessible with Web via Internet - Catalogs of databases available
- Need for a Registry to keep track and offer
quality control