Biological Information and Biological Databases - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Biological Information and Biological Databases

Description:

Biological Information and Biological Databases Meena K Sakharkar Bioinformatics Centre National University of Singapore Biological Information Nature of Life Science ... – PowerPoint PPT presentation

Number of Views:370
Avg rating:3.0/5.0
Slides: 59
Provided by: everestB5
Category:

less

Transcript and Presenter's Notes

Title: Biological Information and Biological Databases


1
Biological Information and Biological Databases
  • Meena K Sakharkar
  • Bioinformatics Centre
  • National University of Singapore

2
Biological Information
3
Nature of Life Science Information
  • Descriptive
  • Classification and Nomenclatural
  • Observational and Phenomenological
  • Experimental
  • Deduced/Computed
  • Simulated?
  • Theoretical?

4
Descriptive
5
Classify and Give Names
  • Classification and Nomenclature
  • Linnaeus - binomial nomenclature
  • Group into kingdoms, phyla, classes, orders,
    families, genera, species, subspecies, strains,
    etc
  • Associate descriptions to these classification
    schema, and classify according to description etc

6
Observational/Phenomenological
  • Like descriptive, yet more active
  • Observe a lot of biological phenomenon
  • Charles Darwin
  • Gregor Mendel to McClintock
  • Start to do some experiments

7
Experimental
  • From dissections to complex genetic engineering
    experiments

8
BioInformatics
  • Deduced/Computed
  • Simulated?
  • Theoretical?

9
What is BioInformatics?
  • Many related terms and buzzwords
  • A multiplicity of names
  • bioinformatics
  • biocomputing
  • biological computing
  • computational biology
  • computational genomics
  • biological data mining

10
Overview of the challenges of Molecular Biology
Computing
  • The huge dataset problem
  • automated DNA sequencers
  • the Human Genome Project
  • bulk sequencing of cDNAs (ESTs)

11
Human Genome Project
  • What is the Human Genome Project?
  • 15-year effort formally begun in October 1990.
    coordinated by the U.S. Department of Energy and
    the National Institutes of Health.
  • identify all the estimated 80,000 genes in human
    DNA,
  • determine the sequences of the 3 billion chemical
    bases that make up human DNA,
  • store this information in databases,
  • develop tools for data analysis, and
  • address the ethical, legal, and social issues
    (ELSI) that may arise from the project.

12
  • Who is head of the U.S. Human Genome Project?
  • The DOE Human Genome Program is directed by Ari
    Patrinos, and Francis Collins directs the NIH
    Human Genome Program.
  • Ari Patrinos also heads the Department of Energy
    Office of Biological and Environmental Research.

13
  • What are the comparative genome sizes of humans
    and other organisms being studied?

If compiled in books, the data would fill an
estimated 200 volumes the size of a Manhattan
telephone book (at 1000 pages each), and reading
it would require 26 years working around the
clock
14
(No Transcript)
15
Informatics Data Collection and Interpretation
  • HUMAN GENETIC DIVERSITY
  • The Ultimate Human Genetic Database
  • Any two individuals differ in about 3 x 106 bases
    (0.1).
  • The population is now about 5 x 109.
  • A catalog of all sequence differences would
    require 15 x 1015 entries.
  • This catalog may be needed to find the rarest or
    most complex disease genes.

16
Databases
17
Basic Terminology
  • What is a nucleotide/protein sequence database
    and
  • databank?
  • Database is a collection of Nucleotide/protein
    sequence and their Associated annotations.
  • Databanks
  • Groups which collect, compile, maintain and
    distribute the database.

18
Fundamental Dogma
19
Work from the Code of Life
20
(No Transcript)
21
Deduced and Computed Information in the Era of
Computational Biology
22
(No Transcript)
23
Databases
  • What are the different kinds of databases and
    their formats?

Nucleic Acid Sequence EMBL at EBI.
GENBANK at NCBI. DDBJ at Japan.
Protein Sequence SWISS PROT NBRF(PIR)
24
Database
  • Protein structure databases
  • PDB
  • Information on the structural data for
    the proteins/nucleic acids.
  • whose 3-D structure solved by X-ray
    crystallography/NMR
  • PDB database
  • NRL 3D Database
  • NRL_3D is a sequence-structure database.
  • Can be used in conjunction with PIR.
  • PDB with PIR.

25
GenBank Entry
26
EMBL Entry
27
SwissProt Entry
28
Other databases
  • Genome Databases
  • GDB Genome Data Bank
  • OMIM
  • Pattern Databases
  • Prosite
  • TFD

29
Usage of databases
  • Annotation Searches - KW, Authors, Features.
  • What is the protein sequence for human insulin?
  • How does the 3D structure of calmodulin look
    like?
  • What is the genetic location of cystic fibrosis
    gene?
  • List all introns in rat?
  • Homology Searches
  • Is there any protein sequence that is similar to
    mine?
  • Is this gene known in any other species?
  • Has someone already cloned this sequence?

30
Usage of databases
  • Pattern searches
  • Does my sequence contain any known motif (that
    can give me a clue about the function)?
  • Which known sequences contain this motif?
  • Is any part of my sequence recoganised by a
    transcription factor?
  • List all known start, splice and stop signals in
    my genomic sequence
  • Prediction - Use the database as knowledge
    database
  • What may the structure of my protein be?
  • Secondary structure prediction
  • Modeling by homology
  • What is the gene structure of my genomic
    sequence?
  • Which parts of my protein have a high
    antigenicity?

31
Usage of Databases
  • Comparisons
  • Gene Families
  • Phylogenetic Trees

32
GenBank Growth Chart
Bases
Year
33
Evolutionary basis of Alignment
  • Enable the researcher to determine if two
    sequences display sufficient similarity to
    justify the inference of homology.
  • Similarity is an observable quantity that may be
    expressed as say identity or some other measure.
  • Homology is a conclusion drawn from this data
    that the two genes share a common evolutionary
    history.

34
Sequence Formats
35
Fasta Format
  • gtSANJAY REFORMAT of SANJAY.seq check 8826
    from 1 to 573 March 12, 1998
  • MASSSVPPMITEEEARFEAEVSAVESWWRTDRFRLTRRPYSARDVVSLRG
    TLHHSYASDQ
  • MAKKLWRTLKSHQSAGTASRTFGALDPVQVTMMAKHLDTIYVSGWQCSST
    HTATNEPGPD
  • LADYPYNTVPNKVEHLFFAQLYHDRKQHEARVSMTREQRAKTPYVDYLRP
    IIADGDTGFG
  • GATATVKLCKLFVERGAAGVHIEDQSSVTKKCGHMAGKVLVAVSEHINRL
    VAARLQFDVM
  • GVETVLVARTDAVAATLIQSNVDLRDHQFILGATNPDFKRRSLAAVLSAA
    MAAGKTGAVL
  • QAIEDDWLSRAGLMTFSDAVINGINRQLPEYEKQRRLNEWAAATEYSKCV
    SNEQGREIAE
  • RLGAGEIFWDWDIARTREGFYRFRGSVEAAVVRGRAFAPHADLIWMETSS
    PDLVECGKFA
  • QGMKASHPEIMLAYNLSPSFNWDAAGMTDEEMRDFIPRIAKMGFCWQFIT
    LGGFHADALV
  • TDTFAREFAKQGMLAYVERIQREERNNGVDTLAHQKWSGANYYDRYLKTV
    QGGISSTAAM
  • GKGVTEEQFKEESRTGTRGLDRGGITVNAKSRL

36
GCG Format
  • ckl.seq Length 473 September 15, 1999 1225
    Type P Check 8103 ..
  • 1 MSTKYSASAE SASSYRRTFG SGLGSSIFAG
    HGSSGSSGSS RLTSRVYEVT
  • 51 KSSASPHFSS HRASGSFGGG SVVRSYAGLG
    EKLDFNLADA INQDFLNTRT
  • 101 NEKAELQHLN DRFASYIEKV RFLEQQNSAL
    TVEIERLRGR EPTRIAELYE
  • 151 EEMRELRGQV EALTNQRSRV EIERDNLVDD
    LQKLKLRLQE EIHQKEEAEN
  • 201 NLSAFRADVD AATLARLDLE RRIEGLHEEI
    AFLRKIHEEE IRELQNQMQE
  • 251 SQVQIQMDMS KPDLTAALRD IRLQYEAIAA
    KNISEAEDWY KSKVSDLNQA
  • 301 VNKNNEALRE AKQETMQFRH QLQSYTCEID
    SLKGTNESLR RQMSEDGGAA
  • 351 GREAGGYQDT IARLEAEIAK MKDEMARHLR
    EYQDLLNVKM ALDVEIATYR
  • 401 KLLEGEESRI SLPVQSFSSL SFRESSPEQH
    HHQQQQPQRS SEVHSKKTVL
  • 451 IKTIETRDGE VVSESTQHQQ DVM

37
Taxonomy Database
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
Blast Results
54
Examples of the New Biology
  • 1. Full genome-genome comparisons
  • 2. Rapid assessment of polymorphic genetic
    variations
  • 3. Complete construction of orthologous or
    paralogous groups of genes
  • 4. Structure determination of large
    macromolecular assemblies/complexes
  • 5. Dynamically simulation of realistic oligomeric
    systems
  • 6. Rapid structural/topological clustering of
    proteins
  • 7. Prediction of unknown molecular structures
    Protein folding
  • 8. Computer simulation of membrane structure and
    dynamic function
  • 9. Simulation of genetic networks and the
    sensitivity of these pathways to component
    stoichiometry and kinetics
  • 10. Integration of observations across scales of
    vastly different dimensions and organization to
    yield realistic environmental models for basic
    biology and societal needs

55
Theoretical?
  • The day will dawn when we will have sufficient
    information to understand how basic life
    functions are integrated into a living cell, and
    how such cells intercommunicate and interoperate
    to function as a living whole. Then maybe, we can
    start talking about theoretical biology

56
Categories of BioDbs - by domain of information
  • DNA
  • RNA
  • Protein
  • Genomic Mapping
  • Pathways
  • Structure
  • Bibliographic
  • Biochemical/Molecular/Miscellaneous

57
Other categories
  • By category of species
  • By families or superfamilies of molecules
  • etc
  • Demo
  • http//www.infobiogen.fr/services/dbcat/

58
Demonstration of BioDatabases
  • Majority of Life Science databases are online,
    accessible with Web via Internet
  • Catalogs of databases available
  • Need for a Registry to keep track and offer
    quality control
Write a Comment
User Comments (0)
About PowerShow.com