Title: The%20BIG%20Goal
1The BIG Goal
- The greatest challenge, however, is analytical.
Deeper biological insight - is likely to emerge from examining
- datasets with scores of samples.
- Eric Lander, array of hope Nat.
Gen. - volume 21 supplement pp 3 - 4,
1999.
Bio-informatics Provide methodologies for
elucidating biological knowledge from biological
data.
2Central Paradigm of Bio-informatics
Genetic Information
3Central Paradigm of Bio-informatics
Molecular Structure
Genetic Information
4Central Paradigm of BioInformatics
Biochemical Function
Molecular Structure
Genetic Information
5Central Paradigm of Bio-informatics
Biochemical Function
Molecular Structure
Genetic Information
Symptoms
6Central Paradigm of Bio-informatics
Biochemical Function
Molecular Structure
Genetic Information
Symptoms
7Computer Science Tools are Crucial
http//www.sanger.ac.uk/PostGenomics/S_pombe/prese
ntations/EMBOCopenhagenWebsite.pdf
8Computer Science Tools are Crucial
- New bio-technologies create huge amounts
- of data.
- It is impossible to analyze data by manual
- inspection.
- Novel mathematical, statistical, algorithmic
- and computational tools are necessary !
9Automated Sequencing
http//cbms.st-and.ac.uk/academics/ryan/Teaching/S
BBioinf/lecture1.htm
10What is Bio-Informatics ?
- A field of science in which Biology, Computer
Science and Information Technology merge into a
single discipline. - Computers ( software tools) are used to
collect, analyze and interpret biological
information at the molecular level. - Goal To enable the discovery of new
biological - insights and create a global perspective for
biologists.
11Disciplines
- Development of new algorithms and statistical
- methods to assess relationships among members
- of large data sets.
- Analysis and interpretation of various types
of - data.
- Development and implementation of tools to
- efficiently access and manage different types
- of information.
12Why Use Bio-Informatics ?
- An explosive growth in the amount of biological
information - necessitates the use of computers for cataloging
and retrieval of data (gt 3 billion bps, gt 30,000
genes). - The human genome project.
- Automated sequencing.
- GenBank has over 16 Billion bases
- and is doubling every year !!!
13New Types of Biological Data
- Micro arrays - gene expression.
- Multi-level maps genetic, physical
- sequence, annotation.
- Networks of protein-protein
- interactions.
- Cross-species relationships
- Homologous genes.
- Chromosome organization.
http//www.the-scientist.com/yr2002/apr/research?0
20415.html
14Why Bio Informatics ? (cont.)
- A more global view of experimental design.
- (from one scientist one gene/protein/diseas
e - paradigm to whole organism consideration).
- Data mining - functional/structural
information - is important for studying the molecular basis
- of diseases, diagnostics, developing drugs
- (personal medicine), evolutionary patterns,
etc.
15Why Bio Informatics ? (cont.)
http//www.library.csi.cuny.edu/davis/Bioinfo_326
/lectures/lect14/lect_14.html
16Future of Genomic Research
- Principle milestones in data mining and genome
analysis - Sanger method for sequencing, invented in 1977
- (winner of the Nobel Prize in 1980),
- Polymerase chain reaction (PCR), invented in
1989 - (awarded the Nobel Prize in 1993).
http//www.usgenomics.com/technology/index.shtml
17The next step Locate all the genes and
understand their function.
This will probably take another 15-20 years !
18Disease Genes Discovered
19(No Transcript)
20The job of biologists is changing
One can efficiently find information Using
databases and software on the web .
Question How likely are you to use a free
bio-informatics library of accessible software ?
http//www.cryst.bbk.ac.uk/classlib/BBSRC_poster/p
otential.html
21Molecular Biology Analysis Software Tools
- Freely Available on the Web. - Highlights
22Broad Classification of Biological Databases
http//www.mrc-lmb.cam.ac.uk/genomes/madanm/pres/b
iodb.htm
23NCBI
ENTREZ - PubMed
24http//www3.ncbi.nlm.nih.gov/Entrez/index.html
25Post-genomic terms (Oct. 2002)
Google search
PubMed
Genome Proteome Transcriptome Gene
function Metabolome Glycome
2.1x106 76,566
89,300 1,701
9,960 229
1.2x106 6.5x105
1,170 29
138 6
From Computational Proteomics, Mark B Gerstein,
Yale U.
26http//cbms.st-and.ac.uk/academics/ryan/Teaching/S
BBioinf/lecture1.htm
27http//cbms.st-and.ac.uk/academics/ryan/Teaching/S
BBioinf/lecture1.htm
28http//cbms.st-and.ac.uk/academics/ryan/Teaching/S
BBioinf/lecture1.htm
29http//cbms.st-and.ac.uk/academics/ryan/Teaching/S
BBioinf/lecture1.htm
30Similarity / Analogy
Examples If looks like an elephant, and smells
like an elephant its an elephant. If walks
like a duck, and quacks like a duck its a
duck.
http//cbms.st-and.ac.uk/academics/ryan/Teaching/m
olbiol/Bioinf_files/v3_document.htm
31Similarity Search in Databanks
Find similar sequences to a working draft. As
databanks grow, homologies get harder, and
quality is reduced. Alignment Tools BLAST
FASTA (time saving heuristics- approximations).
gtgbBE588357.1BE588357 194087 BARC 5BOV Bos
taurus cDNA 5'. Length 369 Score
272 bits (137), Expect 4e-71 Identities
258/297 (86), Gaps 1/297 (0) Strand Plus /
Plus
Query 17
aggatccaacgtcgctccagctgctcttgacgactccacagataccccga
agccatggca 76
Sbjct 1
aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgc
agccatggcc 59
Query 77
agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccagga
agccgtgtca 136
Sbjct 60
agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccagga
agcggtgaca 119
Query 137
gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggg
gcagaaagcc 196
Sbjct
120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaa
gcagggcagaaagcc 179
Query
197 atggaccagctggccaagaccacccaggaaaccatcgacaagactg
ctaaccaggcctct 256
S
bjct 180 atggaccaggttgccaagactacccaggaaaccatcgacc
agactgctaaccaggcctct 239
Query 257 gacaccttctctgggattgggaaaaaattcggcctcct
gaaatgacagcagggagac 313
Sbjct
240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgac
agaagggagac 296
Pairwise alignment
32Multiple Sequence Alignment
Multiple alignment find protein families and
functional domains.
33Structure - Function Relationships
structure
sequence
function
34Protein Structure (domains)
35Phylogeny
Evolution - a process in which small changes
occur within species over time. These changes
could be monitored today using molecular
techniques.
- The Tree of Life
- A classical, basic
- science problem,
- since Darwins 1859
- Origin of Species.
36Searching Protein Sequence Databases - How far
can we see back ?
Tree of Life
Mammalian radiation
Invertebrates/ vertebrates
Plant/ animals
Prokaryotes/ eukaryotes
First self replicating systems
Formation of the solar system
Origin of the universe ?
37The Human Genome Project (HGP)
- Write down all of human DNA on a single CD
- (completed 2001).
- Identify all genes, their location and
- function (far from completion).
38Example for Gene Localization Bio-Tool (FISH).
39FISH - Fluorescence In-Situ Hybridization.
- Fluorescent labeled probes hybridize to
specific - chromosomal locations.
- Example application low resolution
localization of a gene.
40Sequencing Genes Gene Assembly
Automated sequencing
41Gene Finding
- Only 2-3 of the human genome encodes for
functional genes. - Genes are found along large non-coding DNA
regions. - Repeats, pseudo-genes, introns, contamination of
vectors, - are very confusing.
42Gene Finding - cont.
- Find special gene patterns
- Translation start and stop sites (open reading
- frames - ORF).
- Transcription
- factors, promoters.
- Intron splice sites.
- Etc
43(No Transcript)
44Micro Arrays (DNA Chips)
New biotechnology breakthrough measure RNA
expression levels of thousands of genes (in one
experiment).
45The Idea Behind Micro Arrays
46Clustering Analysis of Gene Expression Data
DNA chips and personalized medicine (leading
edge, future technologies).
47Pharmaco-genomics
Use DNA information to measure and predict the
reaction to drugs. Personalized
medicine. Faster clinical trials selected
populations. Less drug side-effects.
48Protein and Other Arrays
Sequencing the human genome gt finite
problem. Studying the proteome gt endless
possible variations, dynamic.
Protein array
Future fields of study Proteins Genomics
Proteomics Lipids Genomics
Lipomics Sugars Genomics Glycomics
49Understanding Mechanisms of Disease
EC number
compound
50Putting it all together Bio-Informatics
ORTHOLOG GENES (Taxonomy)
SEQUENCE ALIGNMENT
CODING REGIONS
CONSERVED DOMAINS
SEQUENCES LITERATURE
3-D STRUCTURE
GENE FAMILIES
SIGNAL PEPTIDE
MUTATIONS POLYMORPHISM
GENOME MAPS
CELLULAR LOCATION
51Putting it all together Bio-Informatics
ORTHOLOG GENES (Taxonomy)
SEQUENCE ALIGNMENT
CODING REGIONS
CONSERVED DOMAINS
GENE EXPRESSION, GENES FUNCTION, DRUG PERSONAL
THERAPY
3-D STRUCTURE
GENE FAMILIES
SIGNAL PEPTIDE
MUTATIONS POLYMORPHISM
GENOME MAPS
CELLULAR LOCATION