Title: Introduction to bioinformatics (I617)
1Introduction to bioinformatics(I617)
- Haixu Tang
- School of Informatics
- Email hatang_at_indiana.edu
- Office EIG 1008
- Tel 812-856-1859
2Textbook
- A Primer of Genome Science (2nd Edition) by Greg
Gibson, Spencer V. Muse, Sinauer Associates, 2004 - Suggested reading materials will be posted on the
class wiki page http//cheminfo.informatics.india
na.edu/djwild/I617_2006_wiki/index.php/Main_Page - Office Hour MW 1100-1200, EIG 1008 or
appointment
3Grading
- Class project selected from one of four covered
areas (bioinformatics, Chemical informatics,
Laboratory informatics and Health informatics)
25 - Suggested Bioinformatics topics will be posted on
the class wiki page - Homework 25 in Bioinformatics
- 4, each 6.25
4Bioinformatics BIOlogy informatics?
- Not really it is a term (somehow arbitrarily
chosen) to define a multi-disciplinary area that
combines life sciences, physical sciences and
computer science / informatics - It addresses biological problems using
theoretical informatics approaches, not vice
versa - It is transforming classical Biology into a
Information Science.
5The birth of bioinformatics
- A revolution in biology research the emergence
of Genome Science - Technology advancement in both biology and
information science
6Genome science a revolution of biology
Hypothesis
Data
Hypothesis driven approach
Data driven approach
7Bioinformatics from data analysis to data mining
Hypothesis
Data
High throughput data
Low throughput data
Hypothesis generation
Hypothesis confirmation / rejection
8Bioinformatics in the drivers seat
Hypothesis
Data
Data mining
Data analysis
9Key technology advancements
- High throughput biotechnologies
- Genome sequencing techniques
- DNA microarray
- Mass spectrometry
- Large-scale experiments
- HGP, HapMap
- Omics / Systems Biology
- Massive data generation, storage, exchange and
analysis - CPU, storage, etc.
- High speed network (Internet)
- Bioinformatics
10Bioinformatics mutually beneficial
- For biologists
- Fragment assembly in genome sequencing
- Genome comparison
- Gene clustering in DNA microarray analysis
- Protein identification in proteomics
- For computer scientists
- String algorithms / Tree algorithms
- Alternative Eulerian path (BEST theorem)
- Reversal distances
- Probabilistic graphic models (HMMs, BNs, etc.)
11Two origins of bioinformatics
- Combinatorial pattern matching in theoretical
computer science - DNA and protein sequence analysis
- Physical and analytical chemistry of Biomolecules
- Protein structure analysis ? Structural
bioinformatics - Bio-analytical chemistry ? Proteomics
12Bioinformatics addresses computational challenges
in life and medical sciences
- New computational problems for automatic data
analysis - Reformulation of old problems using new high
throughput data - Formulating new problems using high throughput
data
13Bioinformatics addresses computational challenges
in life and medical sciences
- New computational problems for automatic data
analysis - Genome sequencing
- Proteomics
- Transcriptomics
- Data representation and visualization
- Genome Browser
- Solving biological problems by in silico
approaches - Reformulation of old problems using new high
throughput data - Gene finding
- Protein structure and function
- Formulating new problems using high throughput
data - Comparative genomics
- Polymorphisms / Population genetics
- Systems Biology
14Bioinformatics resources
- Databases
- Nucleic Acid Research (NAR) annual database issue
- Organization
- ISCB (International Society in Computational
Biology) - Conferences
- ISMB
- RECOMB
- Many other smaller or regional conferences, e.g.
ECCB, CSB, PSB, etc, including local Indiana
Bioinformatics conference
15A case study
- How bioinformatics help and transform classical
biological topics? - Molecular evolutionary studies from anatomical
features to molecular evidences - Genome evolution comparison of gene orders
16Early Evolutionary Studies
- Anatomical features were the dominant criteria
used to derive evolutionary relationships between
species since Darwin till early 1960s
17Early Evolutionary Studies
- Anatomical features were the dominant criteria
used to derive evolutionary relationships between
species since Darwin till early 1960s - The evolutionary relationships derived from these
relatively subjective observations were often
inconclusive. Some of them were later proved
incorrect
18Evolution and DNA Analysis the Giant Panda
Riddle
- For roughly 100 years scientists were unable to
figure out which family the giant panda belongs
to - Giant pandas look like bears but have features
that are unusual for bears and typical for
raccoons, e.g., they do not hibernate
19Evolution and DNA Analysis the Giant Panda
Riddle
- In 1985, Steven OBrien and colleagues solved the
giant panda classification problem using DNA
sequences and bioinformatics algorithms
20Evolutionary Tree of Bears and Raccoons
21Evolutionary Trees DNA-based Approach
- 40 years ago Emile Zuckerkandl and Linus Pauling
brought reconstructing evolutionary relationships
with DNA into the spotlight - In the first few years after Zuckerkandl and
Pauling proposed using DNA for evolutionary
studies, the possibility of reconstructing
evolutionary trees by DNA analysis was hotly
debated - Now it is a dominant approach to study evolution.
22Evolutionary Trees
- How are these trees built from DNA sequences?
23Evolutionary Trees
- How are these trees built from DNA sequences?
- leaves represent existing species
- internal vertices represent ancestors
- root represents the common evolutionary ancestor
24Rooted and Unrooted Trees
In the unrooted tree the position of the root
(common ancestor) is unknown. Otherwise, they
are like rooted trees
25Distances in Trees
- Edges may have weights reflecting
- Number of mutations on evolutionary path from one
species to another - Time estimate for evolution of one species into
another - In a tree T, we often compute
- dij(T) - the length of a path between leaves i
and j - dij(T) tree distance between i and j
26Distance in Trees an Exampe
d1,4 12 13 14 17 12 68
27Distance Matrix
- Given n species, we can compute the n x n
distance matrix Dij - Dij may be defined as the edit distance between a
gene in species i and species j, where the gene
of interest is sequenced for all n species. - Dij edit distance between i and j
28Fitting Distance Matrix
- Given n species, we can compute the n x n
distance matrix Dij - Evolution of these genes is described by a tree
that we dont know. - We need an algorithm to construct a tree that
best fits the distance matrix Dij
29Reconstructing a 3 Leaved Tree
- Tree reconstruction for any 3x3 matrix is
straightforward - We have 3 leaves i, j, k and a center vertex c
Observe dic djc Dij dic dkc Dik djc
dkc Djk
30Turnip vs Cabbage Look and Taste Different
- Although cabbages and turnips share a recent
common ancestor, they look and taste different
31Turnip vs Cabbage Comparing Gene Sequences
Yields No Evolutionary Information
32Turnip vs Cabbage Almost Identical mtDNA gene
sequences
- In 1980s Jeffrey Palmer studied evolution of
plant organelles by comparing mitochondrial
genomes of the cabbage and turnip - 99 similarity between genes
- These surprisingly identical gene sequences
differed in gene order - This study helped pave the way to analyzing
genome rearrangements in molecular evolution
33Turnip vs Cabbage Different mtDNA Gene Order
Before
After
Evolution is manifested as the divergence in gene
order
34Turnip vs Cabbage Different mtDNA Gene Order
35Turnip vs Cabbage Different mtDNA Gene Order
36Turnip vs Cabbage Different mtDNA Gene Order
37Turnip vs Cabbage Different mtDNA Gene Order
38Transforming Cabbage into Turnip
Reversal distance
39History of Chromosome X
Rat Consortium, Nature, 2004