Title: MSRI Summer Graduate Workshop
1Algebraic Statistics for Computational Biology
- MSRI Summer Graduate Workshop
2What is Biology? The study of living
organisms. What is Statistics? The science
concerned with the collection, organization,
analysis and interpretation of data. What is
Algebra? The part of mathematics that deals with
generalized arithmetic.
3What is Algebraic Statistics?
4What is Algebraic Statistics? There is no
dictionary definition yet. The term was coined by
European statisticians interested in applying
Gröbner bases to the design of experiments. Their
book is G. Pistone, E. Riccomagno and H. Wynn,
Algebraic Statistics Computational Algebra in
Statistics. CRC Press, 2000.
5- Table of Contents
- Part I - Introduction to the four themes
- 1. Statistics
- 2. Computation
- 3. Algebra
- 4. Biology
- Part II - Studies on the four themes
- 5. Parametric Inference
- 6. Polytope Propagation on Graphs
- 7. Parametric Sequence Alignment
- 8. Bounds for Optimal Sequence Alignment
- 9. Inference Functions
- 10. Geometry of Markov Chains
- 11. Equations Defining Hidden Markov Models
- 12. The EM Algorithm for Hidden Markov Models
New book
Algebraic Statistics for Computational
Biology Edited by Lior Pachter and Bernd
Sturmfels Cambridge University Press, 2005
6Algebraic Statistics for Computational Biology
Group Department of Mathematics, U.C. Berkeley
http//math.berkeley.edu/lpachter/ascb/
Photo courtesy of Robert Fisher Lawrence Hall of
Science March 7th, 2005
7Who is this girl ?
8The human genome
Consists of 2.8 billion DNA bases. Sequenced in
2001 and finished in 2004. Contains genes -
these are subsequences which code for protein. -
estimated number of genes 20,000-25,000. -
genes make up less than 5 of the
genome. Example Breast-ovarian cancer
susceptibility gene (BRCA1)
9The human genome
10The human genome
11gthg17_dna rangechr1738464686-38473085 5'pad0
3'pad0 revCompFALSE strand? repeatMaskingnoneA
TCCAGAAGTCTAGTATACATCTCAAAATTCATGCATCTGGCCGGGCACAG
TGGCTCACACCTGCAATCCCAGCACTTTGGGAGGCCGAGGTGGGTGGATT
ACCTGAGGTCAGGAGTTTAAGACCAGCCTGGCCAACATGGTAAAACCCCA
TCTCTACTAAAAATACAAGTATTAGCCAGGCATTGTGGCAGGTGCCTGTA
ATCCCAGCTACTCGGGAGGCTGAGGCAGGAAAATCACTTGAACCGGGAGG
CGGAGGTTGGAGTGAGCTGAGATCGTGCTACCGCACTCCATGCACTCTAG
CCTGGGCAACAGAACGAGATGCTGTCACAACAACAACAACAACAACAACA
ACAACAACAACAACAACAACAAATTCTCACATCTAAAACAGAGTTCCTGG
TTCCATTCCTGCTTCCTGCCTTTCCCACTCCCCCATATTCCCTACCATGC
CTTCTTCATCTAATTTAATATTACTAACAAGATCTATTGTTCAAGCCAAA
ACCCAAGTGTCACTCCTTCAATTTCTCTTTACCTTATCCTCCAAATTTAA
TCCATTAGCAAGTCCTCTCTTCAAACCCATCCCAAACCAACCTTGTTTTT
AACCATCTCCACACCACCAATTACCACAAGGATAAAATCTGAATTCCTTA
CCACCAAATACTATGTGATCTGGCCCTCATCTATGACCTTCTCCCATTCC
TTGTGTAATCTCTGCCTCCACACATAATTTGCAAATTACTCCAGCTACAC
TGGCCTATTATTATTATTATTATTATTTTTGAGACGGAGTCTTGCTCTTT
CGCCCAGCCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAATCTCCGC
CTCCTGGGTTCAAGCGATTCTCCTGCCCCAGCCTCCCAAGTAGCTGTGAT
TACAGGCACATGCCACCATTCCCAGCTAATTTTTTTTTGTTTTTGAGATG
GAGTTTCACTCTTGTTGCCCAGGCTGGAGTGCAATGGTGCGATCTCAGCT
CACCACAACCTCCACCTCCCGGGTTGATGAAGTGATTCTCTTGTCTCAGC
CTCCCGTGTAGCTGGGATTAGAGGCACGCGCCACCACGCTGGGCAAATTT
TTGTATTTTTAGTAGAGACAGGGTTTCTACCTCAGTGATCTGTCCGCCTT
GACCTCCCAAAGTGCTGGGATTACAGGAATGAGCCACCACACCCAGCCGT
GCCCAGCTAATTTTTGCATTTTTTAGTAGAGATGGGGTTTTGCCACGTTG
GCCAGGCTGGTCTCAAACTCCTGACCTCAGGGGATCTGCCTGCCTCGGCC
TCCTAGAGTGCTGGAATTACAGGTGTGAGCCACTGTGCCCGAACCTTTTA
TCATTATTATTTCTTGAGACAGGAGTCTTGCTCTGTCGTTCAGGCTGGAG
TGCAGTGATGCGATCTTGGCTCACTGTAACTCCTACCTTTCGGTTCAAGT
GATTCTCCTGCCTCAGCCTCTGGAGTAGCTGGGATTACAGGCACTGGGAT
TACAGGCACACACCACCACACCATGCTAGTTTTTTGTATTTTTAGTAGAG
ATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCGAACTCCTGACCTCAAG
TGATTTGCCTGCCTTGGCTTCCCAAAGTGCTGGGATTATAGGCACGAGCC
ACCACACACGACCAACATTGGCCTATCTTTTAAAAAATAAACCAAGCTCT
GGCCGGGCACAGTGGCTCACACCTGTGATCCCAGCACTTTGGGAGGTTGA
GGTGGTTGGATCACTTGAGTTCAGGAGTTTGAGACCAGCCTGACCAACGT
GGTAAAACCCCATCTCTACTAAAAATAAAAACTAGTCGGGTGTGGTAGCA
CGCGTGCCTGTAATACCAGCTACTCAGGAGGCCAAGGCAGGAGAATTGCT
TGAACCCAGGAGACAGAGTTTGCAGTGAGCCAAGATTGTGCCACTGCACT
CCAGCCTGGGGGATAGAGGGAGACACCATCTCAAAAAAACCAAAATACAG
AAATCAAAAAACCACACTCATTATTACCTCAAGACCTTTATGTTTGCTAT
TCCTCTGCCTATAAGATGCATTCCCTTCATTTTTCAAGGACAATTATTTC
TTGTTATTTAGGTCTCAGCTCAATTTTTTCAGAAAGGCTTTCCCTGGCCT
CCTTAAACGAAAGTAATCAACAACCTTTGACAGCTAATACTATTCCACTG
TTCTGTATATTTCTCCATAGCATTTATTGTTATCTTAAATTCATCTTTAT
TGTGTATCTCCCCTCGACAGAACCTGAATCCTACCAGGGACTTAGTTAGT
CTTATTTACTGTTGCATTCCTAGTGCCCAGAACACAGTAGGCTCCCAATA
AATAGCCACTGAATAAAAGTTAAAACCAACAAAAATAATCATTTAATTAA
TTATGAATACATCGAATTGTGCACAATAGTTTATAAAATTACTTTTTTTT
TTTTTTTAAGACAGGGTCTCATTCTGTCTCACAGGCTGGAGTGCAGTGGT
GCAATCTAGGCTCACTGCAACCTCCGCCTCCCGGGTTCAAGTGATTCTCC
TGCCTCAGCCTCCCCAGCAGCTAGGATTACAGGCACATGCCACCACGCTC
GACTAATTTTTTTGTGTTTTTAGTAGAGACAAGGTTTCACCATGTTGACC
AGGCTGGTCTCGAACTCCTGACCTCAAGTGATCCACCTGCCTTGGCCACT
CAAAGTGCTGGGATTATAGGCATGAGCCACCACGCCTGGCCTATAAAATT
ACTTTCACATTTCATTTTGCCTGATCTGTTGTCACAGAAGTTCTCAGATG
GCTGTTCTGAAATTATTCCTCCTCCTACACTCTATCTTATTTACTTCTCA
CTGTTCTCAGTATCATAAAGTGCAACATCTTTTTGAAGCAATCTGAATTA
TAAACAGATACATTTGCATGTATATATATGTATATATGCATATGCACACA
CACACTTTTTTTTTTTTAAGAGACAGGGTCTTGCTCTGTGCAAGTGCAAG
AGTGCAATGGTATGATCATAGCTCACTGCAGCCTTGAACTCCTGGGCTCA
AGTGATTCTTCTGGCTTAGCTTCCTCAGTAGCTAAGACTACAGAAGCACA
CTGCCATGCCCGGCTAATTAAAAAAAAATTTTGTGGAGACAGAGTCTCAC
TATGTTGCCCAGGCTGGTTTCAAACTCCTGGCCTCAAGTAATCTTCCTGT
CTCAGCCTCCCAAAGGGCTGAGATTATAAGTGTGAGCCACTGCATCTGGA
CTGCATATTAATATGAAGAGCTTTTCTTCAACAACAGTGAACAGTTTTCT
ACAAAGGTATATGCAAGTGGGCCCACTTCTTGTTCTTATGAATCTTTTCT
TTCCTTTTATAAAACTCCTTTTCCTTTCTCTTTTCCCCAAAGAAAGGACT
GTTTCTTTTGAAATCTAGAACAAATGAGAACAGAGGATATCCTGGTTTGC
GCTGCAAAATTTTTTTTTTTTTTAAGACGGAGTCTCGCTCTGTTGCCAGG
TTGGAGTGCAGTGGCACGATCTTGGCTCATTGCAACCTCCACCTCCCGGG
TTCAAGAGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGAACTAAAGGCG
CATGCCACCACGCTGAGTAATTTTTTGTATTTTAGTAGAGACAGGGTTTC
ACCATGTTGCCCAGGCTGATCTCGAACTCCTGAGCTCAGGCAATCTGCCT
GTCTTGGCCTCCCACAGTGTTAGGATTACAGGCATGAGCCACTGCACCCG
ATTTTTTTTTTCTTTTGATGGAGTTTTGCTCTTGTTGCCCAGGTTAGAGT
GCAATGATGCGATCTCAGCTCACTGCAACCCCCGCCTCCCAGGTTCAAGT
GATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGAATTACAGGCAAGTGCCA
CCAAGCCCGGCTAATTTTGTATTTTTAGTAGAAACGGGGTTTCTCCATGT
TGGTCAGGCTGGTCTTGAACTCCCGACATCAGGTGATCCAAGCGCCTCAG
CCTCCCAAAGCGCTGGGATTATAGGTATGAGCCACAGTGCAGGCCTGCAT
AATTCTTGATGATCCTCATTATCATGGAAAATTTGTGCATTGTTAAGGAA
AGTGGTGCATTGATGGAAGGAAGCAAATACATTTTTAACTATATGACTGA
ATGAATATCTCTGGTTAGTTTGTAACATCAAGTACTTACCTCATTCAGCA
TTTTTCTTTCTTTAATAGACTGGGTCACCCCTAAAGAGATCATAGAAAAG
ACAGGTTACATACAGCAGAAGAACGTGCTCTTTTCACGGAGATAGAGAGG
TCAGCGATTCACAAAAGAGCACAGGAAGAATGACAGAGGAGAGGTCCTTC
CCTCTAAAGCCACAGCCCTTTAATAAGGCTTGTAGCAGCAGTTTCCTTCT
GGAGACAGAGTTGATGTTTAATTTAAACATTATAAGTTTGCCTGCTGCAC
ATGGATTCCTGCCGACTATTAAATAAATCCCTAGCTCATATGCTAACATT
GCTAGGAGCAGATTAGGTCCTATTAGTTATAAAAGAGACCCATTTTCCCA
GCATCACCAGCTTATCTGAACAAAGTGATATTAAAGATAAAAGTAGTTTA
GTATTACAATTAAAGACCTTTTGGTAACTCAGACTCAGCATCAGCAAAAA
CCTTAGGTGTTAAACGTTAGGTGTAAAAATGCAATTCTGAGGTGTTAAAG
GGAGGAGGGGAGAAATAGTATTATACTTACAGAAATAGCTAACTACCCAT
TTTCCTCCCGCAATTCCTAGAAAATATTTCAGTGTCCGTTCACACACAAA
CTCAGCATCTGCAGAATGAAAAACACTCAAAGGATTAGAAGTTGAAAACA
AAATCAGGAAGTGCTGTCCTAAGAAGCTAAAGAGCCTCAGTTTTTTACAC
TCCCAAGATCAATCTGGATTTATGATTCTAAAACCCCTGGTGACAGAATC
AGAGGCTGAAAACACCACTAATTATAACCAGCAGGTATGGATATTTGGAA
GTCTAGGGGAGGCTGATATGAAGTTAAGACCAGAGGAAATATCTGTCCAC
TCCCTCTTCTCAACACCCATCTTCTAGACGCCAAGGCTAGCTATAGATCT
CCATTATAGTGTTCAAGGAATTAGGAATTATCCATGTCAATAGTTTTGAT
TAATGTGGACGGAGAACATCTATATTACTAGATGGCAATATGTGAAAGAA
GAAAACAGTATTGTTGAAAACCTAAATCTGAAATGTCAATGTAATGACAA
ATTTTCACCCCTAGAATGTCTACCTGGGGAGTCCTAACCCTCTAATATTC
CCCTGAGAGGGATGGGAGAATACAGTGCAGAGCTTTTATATAAGTATTTC
AGAAAGCAGTAGCTAAAGAATCACTTGTTTATTTCCCAGTGTTTCAAAGG
CCCTTCTGAAGAACTAAGCAAACTAAGGAAAGACCATTTAGTTTTAAACA
GGAGAAATGTATTTAACTAAATCCTAAACACAGCAGGCTATCTGCAAGCA
GCAGCAGCAGCAGCAGCCATGCTCCCTCACAGAATCCTTACAATTTTTGA
AGTTTTTTGTTTAACTGCTACAAAAGCCGATTTAGTAACATTTATTACAC
TTAAAAACTTCAGTTCATTTGTAGTTCAAAGCAAATGTATTGGCTTTGAG
TTTAAAGACTGAACTACTTTAGATTTGATTTGCATTTTTTTTTTTTTTTT
TTTTTGAGATGCAGTCTTGCTCTGTCAGCCAGGCTGGAGTGCAGTGGCTG
GATCTCAGCTCACGGCAAGCTCTGCCTCCTGGGTTCATGCCATTCTCCTG
CCTCAGCCTCCTGAGTAGCTGGGACTACAGATGCCCGCCACCATGCCCGG
CTAATTTTTTGTATTTTTACTAGAGATGGGGTTTCACCGTGTTAGCCAGG
ATGGTCTCGATCTCCTGACCTCGTGATCTGCCCGCCTTGGCCCCCCAAAG
CGCTGGGATTACAGGCCTGAGCCACCACGCTTGGCATCTTTTTACCTTTC
ATTAACTTTGATGCAAACCTATAGCTTAAGGTATCTTAAACTTTAATGAC
ATTTTTCTCTAAAATAGTAGTTTGTAATAACTTGTTCTGGCACCTGGCTC
CAATGAACACTACCCTCTGACCCTGTGGTATAATTTTCATGAGTAAGTGG
AAACCTAAGATCTTAGAAGTTCAACGGCAATGTGTCCAAGGGGTTTAGAT
CCTCTCCTTAAGTGCCTGTATCTCTGTGAAAAGAATCATCATAGGCTAGG
CGCGATGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGTAGGT
GGATCACCTGAGGTCGGGAGTCCAAGACCAGCCTGACTGACATGGAAAAA
CCCTGTCTCTACTAAAAATACAAAATTAGGTATGGTGGTGCATTCCTGTA
ATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAG
GGGGAGGTTGCAGCAAGCCAAGATCGTGCCATTGCACTCCAGCAGCCTGG
GCAACAAGAGTGAAAAACTACACCTCAAAAACAAAAACAAAAACAAAAGA
ATCATCATCAAGTGAACTGGAACACATCCAGAGAACTAATTTTGTTAGAA
AGATTTTAGAGTTGAGCCACACAATCTGCATCTTCTGCGTCCTCCATGCA
CTCGTCTGCTTTCTGGAGCCCCATGAGTGAGTCTTAATCCTGTTCCAGAT
AACAGTTCTCTTCCGGGTAACGGTTCTTCAGATACTTGAAGACAGTGTCT
TATTTCCTTAAATCTTCTCATTTCTTCTTCAAAAGACAGTATTTCAAGTT
ACTTTTATGTATCTTTACCATCTACCTCTGGATAAACACTCTCCAATTTG
TCAGTGACCATGTTAAAAACCAAGCACGGTGCTTAAAACTGACATCATCT
TTCAGGCAATCACTCCATTGGAGAATACAGTGGGGCTCTGGATCTGTACT
TCACTTGCTCCAGAGCCTCTGCTTGTGTTAATACGGCCCAGTTTCAAATA
AGCATTTTTAGCAGCCCTGAAATGTGTACTCAGATTTAGTTTATAGTCAA
CTAAAAACACCCAGAGGTCTCCTGTATTACACAAGTTATAATTAAAACCT
TAAAAGAGAAAGGTATAGGACAAATGATCTGTCTCCTCCCTTTTTTGCTT
TTTCATATGTTAAGACTATCTCGGAGCTGTTATCAGACTTTTTTCCTGAA
AAACTCTCAACAATACTCAAACTAGGTGTTACATGAAGCTGGGGTCTCCA
GGTTTTGCCTCACTTGTTCTTTCTTTTGTTGTTGTTGAGACAGAGTCTCA
CTCTGTCGCCAGGCTGGAGTGCAGTGGCAGGATCTCAGCTGACTGCAACC
TCAGCCTCCAGAGTTCAAGCAATTCTTCTGTGTCAGCCTCCCAAGTAGCT
GGGATTACAGGTGCACACCACCACGCCCAGCCA
12Another example of annotation
13tctctggttagtttgtaacatcaagtacttacctcattcagcatttttct
ttctttaatagactgggtcacccctaaagag tccgggattagtctgta
tgaggtacccaccacactcagaagttttctttcttggatagacttgatca
cccctgaagagaag
14Data summary
tctctggttagtttgtaacatcaagtacttacctcattcagcatttttct
ttctttaatagactgggtcacccctaaagag tccgggattagtctgta
tgaggtacccaccacactcagaagttttctttcttggatagacttgatca
cccctgaagagaag
15Statistics Question Are the two sequences
independent?
Algebra Question
Is the 4x4 matrix close to rank 1?
16- The independence model
- m 16 observable states A,C,G,T2
- d 6 unknown parameters
- (aA , aC , aG , aT , bA , bC , bG , bT) where
- aA aC aG aT bA bC bG bT 1
- Independence means probabilities factor
AG prob(A,G) aAbG
17- The independence model
- m 16 observable states A,C,G,T2
- d 6 unknown parameters
- (aA , aC , aG , aT , bA , bC , bG , bT) where
- aA aC aG aT bA bC bG bT 1
- Independence means probabilities factor
AG prob(A,G) aAbG - The model is the polynomial map
- (a,b) aTb
18Models for discrete data
A statistical model is a parameterized family of
probability distributions
U Q
U D
d number of parameters m number of observable
states Q the parameter space D probability
simplex on the m states
19The geometry of maximum likelihood estimation
parameter space
data
probability simplex
20Observed data
tctctggttagtttgtaacatcaagtacttacctcattcagcatttttct
ttctttaatagactgggtcacccctaaagagatc tccgggattagtct
gtatgaggtacccaccacactcagaagttttctttcttggatagacttga
tcacccctgaagagaag
21tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCT
TTCTTTAATAGACTGGGTCACCCctaaagagatc tccgggattagtct
gt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACT
TGATCACCCctgaagagaag
Hidden data
22The alignment problem is to find the shortest
path in the alignment graph
This is solved with dynamic programming and is
known in computational biology as the
Needleman-Wunsch algorithm.
23The algebraic statistical model for sequence
alignment, known as the pair hidden Markov
model, is the image of a map
whose coordinates are polynomials with one term
for each path in the alignment graph.
The logarithms of the 33 parameters give the edge
lengths for the shortest path problem on the
alignment graph.
24General Mathematical Framework
- Statistical models are algebraic varieties.
- Algebraic varieties can be tropicalized.
- Tropicalized models are useful
- for MAP inference in statistics.
L. Pachter and B. Sturmfels, Tropical Geometry of
Statistical Models, Proceedings of the National
Academy of Sciences, Volume 10146 (2004), p
16132--16137. L. Pachter and B. Sturmfels,
Parametric Inference for Biological Sequence
Analysis, Proceedings of the National Academy of
Sciences, Volume 10146 (2004), p 16138--16143.
252.1. Tropical arithmetic and dynamic programming
In tropical algebraic geometry, varieties are
piecewise linear
26Comparative Genomics
human
tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCT
TTCTTTAATAGACTGGGTCACCCctaaagagatc tccgggattagtct
gt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACT
TGATCACCCctgaagagaag
rat
27Comparative Genomics
A phylogenetic tree on 5 taxa.
28Comparative Genomics
Petersen graph parametrizes trees on 5 taxa.
29Trees are Ubiquitous in Biology
Fig. 1.
Y Chromosome of D. pseudoobscura Is Not
Homologous to the Ancestral Drosophila Y
Antonio Bernardo Carvalho and Andrew G. Clark,
Science, January 7 2005.
301
5
2
4
3
1
5
4
2
3
1
4
2
5
3
1
3
2
4
5
31Summer school Themes
Algebra, discrete mathematics and statistics
are relevant for genomics and vice versa...
Organ (liver)
Organ system (digestive)
Tissue (liver sinusoid)
Cell (hepatocyte)
Organelle (nucleus)
TAGAGACGGGGGTTTCACAATGTTGGCCA
Molecule (DNA)