Title: Bioinformatics: Buzzword or Discipline (???)
1Bioinformatics Buzzword or Discipline (???)
2Outline of the course
- Analysis of one DNA sequence Shotgun sequencing,
Markov-Chain modeling, patterns and repeats. - Analysis of multiple DNA or protein sequences
Dynamic programming alignments, substitution
matrices. - BLAST Algorithm for sequence retrieval and
comparison. - Refresher on Markov Chains Capsule theory,
Markov-Chain Monte Carlo algorithms. - Hidden Markov Models Viterbi Algorithm and its
applications. - Evolutionary Models Models of nucleotide
mutation and substitution, recombination and
genetic drift, with applications to genome
evolution and gene mapping. - Molecular phylogenetics (tree making) distance
matrix, maximum likelihood and parsimony. - Special topics Gene and protein networks,
analysis of DNA-microarray data,
330,000
Genes make up only 3 of the genome
BCM- HGSC
4Genome Sizes
Human 3.0 x 109 base pairs
Mouse 3.0 x 109
Drosophila 1.1 x 108
Worm 1.0 x 108
Dictyostelium 3.4 x 107
Yeast 1.2 x 107
Bacteria 1.0 - 5.0 x 106
5Shotgun Sequencing
High Accuracy Sequence lt 1 error/ 10,000 bases
6The Human Genome 3 Billion Base PairsWhole
Genome Shotgun Strategy
Genome 3 billion bases
Libraries of clones 3kb, 10kb, 50kb base pairs
DNA sequence reads 500 bases each
AGGCTCACTG
BCM- HGSC
7Statistical issues in shotgun strategy
- Model for the random fragments Binomial/Poisson
process - Coverage of sequence by random fragments
- Mean number of contigs
- Mean size of contigs
- Coverage by anchored contigs
8Binomial/Poisson Process
- N fragments, of length L each, randomly scattered
in the interval of length G. - Coverage a NL/G
- Contig Union of overlapping fragments. We want
to have them cover as much of G as possible. - Prfrags with left end in (x, x-h) k is
binomial(N,h/G) or approximately Poisson(Nh/G)
(when?).
9Mean number of contigs
- Econtigs N ? Pra frag is rightmost in a
contig - N ? Prfrag does not include the left end of
any other frag - N ? exp(- NL/G) (aG/L) ? exp(- a)
L 800 G 100,000
10Mean contig size
- ES Efrags-1 Einter-epoch distance L
11Mean contig size
E(S)
a
12Number of anchored contigs
anchors M frags N a NL/G b ML/G
Eanchored contigs Nb exp(-a)-exp(-b)/(b-a
)
13Conclusions
- Expected number of contigs first increases, then
decreases with coverage. - Expected size of contig increases with coverage.
- Expected number of anchored contigs first
increases then decreases with anchor density . - Attention Computations do not involve boundary
effects.