Title: Genome-wide Shotgun Mapping, Validating, Sequence Aligning
1Genome-wide Shotgun Mapping, Validating,
Sequence Aligning Population Studies.
NYU Faculty Research Descriptions October 16 2001
-
- Bud Mishra
- Professor of Computer Science Mathematics
(Courant Institute) - Professor (Cold Spring Harbor Laboratory)
- http//www.cs.nyu.edu/mishra
2People
- Senior Research Scientists
- Marco Antoniotti (CS)
- Archisman Rudra (CS)
- Junior Research Scientists
- Toto Paxia (CS)
- Marc Rejali (Bio/CS)
- Valerio Luccio (CS)
- Joe McQuown (Stat/OR)
- Graduate Students
- Joey Zhou (Biology)
- Will Casey (Math)
- Vera Cherepinsky (Math)
- Collaborators
- Mt Sinai School of Medicine
- Harel Weinstein
- Bob Desnick
- Courant Institute
- Misha Gromov
- Biology Dept NYU
- Gloria Coruzzi
- Phil Benfey
- Collaborators
- Cold Spring Harbor Lab
- Mike Wigler
- Dick McCombie
- Vivek Mittal
- University of Wisconsin
- Tom Anantharaman
- David Schwartz
- Memorial Sloan Kettering
- Larry Norton
3Shotgun Sequence Assembly
- A jigsaw puzzle
- Assembles a collection of words while minimizing
errors. - An Example (somewhat unrealistic)
- Words complete, correct, -ly, but, the,
sequence, in-, human, published, is, genome. - Solution 1 Minimize unused letters
- The published human genome sequence is incomplete
but correct. - Solution 2 Minimize the number of spaces
- The published human genome sequence is completely
incorrect.
4Validation, Alignment Assembly
5Shotgun Mapping
- Large fragments of genomic DNA of length from 2Mb
to 12Mb are optically mapped - The resulting ordered restriction maps are
automatically contiged by Gentig - The consensus map computed by Gentig is free of
errors due to partial digestion, sizing error and
false cuts
6Shotgun Mapping
- Schematics
- Surface Chemistry
- Robotics
- BioChemistry
- Imaging
- Image Analysis
- Statistical Algorithms
- Visualization
7Gentig MapsPlasmodium falciparum
- A. Gap-free consensus BamHI NheI maps for all
14 chromosomes. - B.BamHI map
- C. NheI map
- D.NheI map of Chromosome 3 displayed by ConVEx
8Validation Its Software Architecture
9Dynamic Programming Recurrence
Ti,j minu 5 i, v 5 j Ti-u, j-v
ln2p(si2L s(I-u)2)1/2/pc (lj L
l(j-v)) (ai L a(i-u))2/2(sj2
Ls(j-v)2) (u-1) ln(1/(1-pc)) (v-1) ln(1/pf)
10P. Falciparum c14 Alignment
11P. Falciparum c14 Alignment
12Map Assisted Sequence Assembly
- Using Multi-Enzyme Optical Maps to anchor
Sequence Contigs.
- Sequence Assembly (Speed Accuracy)
- Sequence Validation
- Sequence Contig Phasing
- Characterizing the gaps and finishing
13Sequence Anchoring
- Probability that a random sequence Y (Y L)
gets anchored at the wrong position in a map of
genome length G - PFP ¼ 1 e-prG
- where
- r 5 2 (pL)m e-mpL(1-b/2)
- Number of enzymes m
- Cutting rate p
- Relative sizing error b
For BACs about 5 enzymes suffice to anchor 5Kb
sequence contigs
14POPULATION STUDIES
15Population Analysis
- Algorithms to rapidly discern kilobase-size
differences in the genome-wide shotgun maps of a
large number of individuals. The resolution can
be further improved. - Markers characterized by optical maps
- Breakpoints
- Amplification and Deletions
- RFLPs
Detailed mathematical analysis shows the
feasibility of this schemes with even a low
coverage maps.
16A Simple Analysis
- CR Coverage of the reference
- CT Coverage of the individual
- G Length of the genome
- pE Cutting frequency of the enzyme
- b Sizing error
- L Length of the genomic fragment
- X A region of differences
- PrA single fragment covers X
- (L-X)/(G-L)
- PrAt least one of n random fragments covers X
- 1- (L-X)/(G-L)n e-c(1-X/L)
- Probability that the difference of length X is
detected by genome-wide map - 1 e-CR(1-X/L) e-CT(1-X/L)
- 1 - (b/2)(X pE 1)
- (See Mathematica Demo GenomeCompare.nb)
17An Example
- Genome Length
- G 3300 Mb
- Average Fragment Length
- L 2 Mb
- Coverages
- CR 12
- CT 6
- l 25 Kb e 3 Kb b e/l
- pE 1/l
- Probability that the difference of length X is
detected by genome-wide map - 1 e-CR(1-X/L) e-CT(1-X/L)
- 1 - (b/2)(X pE 1)
- (See Mathematica Demo GenomeCompare.nb)
18Some Interesting Applications
- Haplotyping
- Phasing haplotypes unambiguously (both for SNPs
and RFLPs) - Rearrangement events
- Amplifications and Deletions
- Translocations
- Synteny groups
- Hemizygous Deletions
19Improving the Resolution
- Markers, characterized by genome-wide optical
maps, can be indexed to genes. - Functionality of these genes can be established
from - homology searches
- motif identification or
- simply literature searches
- Many genes will need to be characterized in the
context of experimental systems, and populations.
- Genotype/phenoype relations can be established
through large population studies, or kinship
analysis, where the analysis is a combination of - high-resolution cytogenetics (optical maps) and
- RFLP analysis.
20GENTIG ALGORITHM
21Genomic Contig ProblemGCP
- Given M intervals (genomic DNA segments) each of
length L - D1, D2, , DM
- Dj ) 0 lt cj1 lt cj2 lt lt cjn lt L
- cj1 True or False (Optical) Cut Sites
- p Digestion rate,
- k (gt3) Goodness
- Goal Place M intervals on the real line by
fixing the alignment (orientation position) of
each D_j - Dj a Aj (Dj, sgn, xj
- Int(Dj) Ij xj, xj L
- Subject to additional constraints
22Constraints forGCP
- Composite Map
- 0 lt m1 lt m2 lt L lt mK
- 8 mi, ( mi 2 Aj / mi 2 Ij ) gt p
- Every admissible placement induces a permutation
of Dis (determined by the positions of their
left ends) - p ! Permutation, Ap(1), Ap(2), , Ap(M)
- Goodness
- c(A1, , AM) min mi 2 Ap(j) Å Ap(j1) k
23GCP is NPComplete
- Transformation from Hamiltonian Path Problem
restricted to cubic graphs.
Choose p 3/4 k M
24NPCompleteness
- G has a Hamiltonian path
- v1, v2, vM
- Then, the admissible placement is
- D1, D2, DM
- with at most two intervals Ij Ij1
- overlapping with k cuts in common.
- Conversely, any admissible
- placement with a goodness gtk induces a
- permutation p on the indices of the
- vertices of G.
- v(p(1)), v(p(2), , v(p(M))Hamiltonian
D1
D2
D3
Consensus Map
25Overlap Rule
- Comparing Two Genomic Restriction Maps
Given two maps A and B, we say that they overlap,
if --- 1. k or more of the restriction fragments
align positionally (subject to sizing error) 2.
Number of unmatched fragments in either prefix is
bounded by r
26Comparing MapsEffect of Partial Digestion
- Parameters
- Partial digestion probability, p
- Relative sizing error, b
- Restriction fragments, n
- Overlap threshold ratio, q
- m n p Expected detected restriction
fragments. - Controlling False Negative
- K 5 np4 q/2 and r k1/p4, k1 ¼ 2
- If in fact the clones A and B overlap then we
will it detect with a probability, at least - (1-exp(-k1)) (1 exp(-n p4 q/8))
27Overlap Rule
- Controlling False Positive
- Consider an arbitrary alignment Let the random
variable W denote the number of fragments in
clone A that positionally match with the
fragments of clone B. - EC(W, i) C(m, i) (b/2)i ¼ (1/i!) (np b/2)i
- By Bruns sieve
- PrW i (1/i!) (b n p/2)i exp(-b n p /2)
- Poisson b n p /2
- and the false positive probability is
- 4 r å1ik (1/i!) (b n p)I e-b n p/2
- Make r as small k as large as possible
28Experimental Design
- Relation among the error parameters
- 3b n p /4 5 k 5 n p4 q/2
- ) p (3 b/2 q)1/3
- Parameter choice for shotgun-mapping. Make the
partial digestion probability rather high (close
to 1) or the relative sizing error as low for
instance by using a rare cutter.
29Contour Plot as a Function of Sizing Error
(x-axis) and Digestion Rate (y-axis)
- The calculation is for the human genome, G 3,300
mb. - The average molecule length 5 mb, with an
overlap of 1 mb - The average restriction fragment length 25 kb
- For a sizing error of 3 kb, the required
digestion rate is 80 - If the sizing error is reduced to 2 kb, the
required digestion rate drops to 70 - (See Mathematica Demo GentigFeasibility.nb)
30Gentig (GENomic conTIG) Algorithm
- An upper bound estimate of the false positive
overlap probability - A Bayesian probability
estimate for the proposed placement
Maximize the Bayesian Probability Density subject
to the False Positive Probability Constraint
GREEDY ALGORITHM
31Other Ongoing Projects
- Valis Bioinformatic Environment Language
- (Funded by DOE NYSTAR)
- Microarray-based Genomic Mapping
- (In collaboration with CSHL funded by NCI/NIH)
- Expression Data Analysis
- (In collaboration with NYU Biology funded by
NSF) - Cell Informatics
- (Funded by DARPA)
32Valis Architecture
- Valis aims to address all aspects of post-genomic
biology. - With this goal in mind we built a powerful
computational infrastructure - With a distributed architecture consisting of a
Linux cluster and customized special hardware for
homology search - A large database system for massive amounts of
biological data in multiple forms - Mathematical, statistical and algorithmic tools
that can handle the multitude of scientific
problems arising from bioinformatics, comparative
and functional genomics, cell informatics,
population genomics, etc.
33Algorithmic Support
- Wide classes of mathematical and computational
tools are integrated into Valis - Most of the work and the interesting technical
developments are algorithmic. The tools rely on
mathematical ideas from - combinatorics,
- probabilistic methods,
- statistics,
- kinetic modeling,
- and dynamical and discrete event systems.
- For example, to construct probe maps using
microarrays, the algorithm relies on a
probabilistic analysis of when nearby probes get
hybridized by a low coverage sample from a clone
library. This analysis is built into the design
of the microarray experiments, and also exploited
by the data structures used in the algorithm.
34Bio-computing
- Joint Project involving Cold Spring Harbor Lab
Courant Institute - Algorithmic Tools and Computational Frameworks
for Cell Informatics 2001-2004 - Two areas of interest
- Computational Tools
- Valis Informatics Tools
- Simulation Tools
- Reasoning Tools
- Biological Experiments
- DNA Evolution
- Cell Communication
35Cocultivation Experiments
- Cells signal through communication proteins
- Many communication proteins fall into two
classes - Extracellular factors and
- External receptors.
- Factor-receptor interactions occur in pairs and
influence the genes and proteins that cells
express. - Factors and receptors are encoded by genes, about
a thousand of each class. - Only a few of each class are expressed in cells
of a particular type.
- The pairing of factor and receptors are largely
unknown. - The consequences of most factor-receptor
interactions are unknown. - These pairings and their consequences are
explored by cell cocultivation experiments. - We examine cell type A and B alone, and when
cocultured (A c B)We examine the genes
expressed by cells using DNA microarrays, that
quantitate tens of thousands of genes
simultaneously.
36Experimental Results
- Cell A is a carcinoma (derived from ectoderm),
- Cell B is a sarcoma (derived from mesoderm),
- The data displayed are ratios of expressed genes,
each point the ratio of either A or B alone, or A
and B cocultured (A c B) vs the combination of
both A and B cultured alone and then combined
(AB).
37Simulation Inference
- A Rudimentary Simulation with Mathematica
- Specifications of cell types-
38Concluding Remarks
- The central claim of this proposal is that, by
drawing upon mathematical approaches developed in
the context of dynamical systems, kinetic
analysis, computational theory and logic, it is
possible to create powerful simulation, analysis
and reasoning tools for working biologists to be
used in deciphering existing data, devising new
experiments and ultimately, understanding
functional properties of genomes, proteomes,
cells, organs and organisms.
- Risks Challenges
- Effectively integrating diverse methodologies to
study a monolithic, heterogeneous and complex
system - Multiple hierarchical levels of fidelity
- Multiple spatio-temporal scales
- Automatically designing experiments that can
falsify and eventually revise existing models. - Milestones and Deliverables
- Kinetic-modeling prototyping tool with VALIS
interface - A quantitative sequence analysis tool
- A genome-generator (based on a stochastic
context-free grammar and simulation of
genome-rearrangements) - A preliminary integrated tool combining
simulation, visualization, numerical integration
and symbolic algebraic analysis - A hybrid system simulator and modal logic
reasoning system - A simulation and combinatorial/probabilistic
analysis system for DNA repair, polymerase
stuttering and unselected DNA drift (using
kinetic models) - A multi-fidelity model of signal transduction
- Experimental validation of microarray-based,
computationally driven model of the RAS pathway