Genome-wide Shotgun Mapping, Validating, Sequence Aligning - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Genome-wide Shotgun Mapping, Validating, Sequence Aligning

Description:

Memorial Sloan Kettering. Larry Norton Bud Mishra, 2001. 3 ... Dj ) 0 cj1 cj2 ... cjn L. cj1 = True or False (Optical) Cut Sites. p = Digestion rate, ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 37
Provided by: csN4
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Genome-wide Shotgun Mapping, Validating, Sequence Aligning


1
Genome-wide Shotgun Mapping, Validating,
Sequence Aligning Population Studies.
NYU Faculty Research Descriptions October 16 2001
  • Bud Mishra
  • Professor of Computer Science Mathematics
    (Courant Institute)
  • Professor (Cold Spring Harbor Laboratory)
  • http//www.cs.nyu.edu/mishra


2
People
  • Senior Research Scientists
  • Marco Antoniotti (CS)
  • Archisman Rudra (CS)
  • Junior Research Scientists
  • Toto Paxia (CS)
  • Marc Rejali (Bio/CS)
  • Valerio Luccio (CS)
  • Joe McQuown (Stat/OR)
  • Graduate Students
  • Joey Zhou (Biology)
  • Will Casey (Math)
  • Vera Cherepinsky (Math)
  • Collaborators
  • Mt Sinai School of Medicine
  • Harel Weinstein
  • Bob Desnick
  • Courant Institute
  • Misha Gromov
  • Biology Dept NYU
  • Gloria Coruzzi
  • Phil Benfey
  • Collaborators
  • Cold Spring Harbor Lab
  • Mike Wigler
  • Dick McCombie
  • Vivek Mittal
  • University of Wisconsin
  • Tom Anantharaman
  • David Schwartz
  • Memorial Sloan Kettering
  • Larry Norton

3
Shotgun Sequence Assembly
  • A jigsaw puzzle
  • Assembles a collection of words while minimizing
    errors.
  • An Example (somewhat unrealistic)
  • Words complete, correct, -ly, but, the,
    sequence, in-, human, published, is, genome.
  • Solution 1 Minimize unused letters
  • The published human genome sequence is incomplete
    but correct.
  • Solution 2 Minimize the number of spaces
  • The published human genome sequence is completely
    incorrect.

4
Validation, Alignment Assembly
5
Shotgun Mapping
  • Large fragments of genomic DNA of length from 2Mb
    to 12Mb are optically mapped
  • The resulting ordered restriction maps are
    automatically contiged by Gentig
  • The consensus map computed by Gentig is free of
    errors due to partial digestion, sizing error and
    false cuts

6
Shotgun Mapping
  • Schematics
  • Surface Chemistry
  • Robotics
  • BioChemistry
  • Imaging
  • Image Analysis
  • Statistical Algorithms
  • Visualization

7
Gentig MapsPlasmodium falciparum
  • A. Gap-free consensus BamHI NheI maps for all
    14 chromosomes.
  • B.BamHI map
  • C. NheI map
  • D.NheI map of Chromosome 3 displayed by ConVEx

8
Validation Its Software Architecture
9
Dynamic Programming Recurrence
Ti,j minu 5 i, v 5 j Ti-u, j-v
ln2p(si2L s(I-u)2)1/2/pc (lj L
l(j-v)) (ai L a(i-u))2/2(sj2
Ls(j-v)2) (u-1) ln(1/(1-pc)) (v-1) ln(1/pf)
10
P. Falciparum c14 Alignment
11
P. Falciparum c14 Alignment
12
Map Assisted Sequence Assembly
  • Using Multi-Enzyme Optical Maps to anchor
    Sequence Contigs.
  • Sequence Assembly (Speed Accuracy)
  • Sequence Validation
  • Sequence Contig Phasing
  • Characterizing the gaps and finishing

13
Sequence Anchoring
  • Probability that a random sequence Y (Y L)
    gets anchored at the wrong position in a map of
    genome length G
  • PFP ¼ 1 e-prG
  • where
  • r 5 2 (pL)m e-mpL(1-b/2)
  • Number of enzymes m
  • Cutting rate p
  • Relative sizing error b

For BACs about 5 enzymes suffice to anchor 5Kb
sequence contigs
14
POPULATION STUDIES
15
Population Analysis
  • Algorithms to rapidly discern kilobase-size
    differences in the genome-wide shotgun maps of a
    large number of individuals. The resolution can
    be further improved.
  • Markers characterized by optical maps
  • Breakpoints
  • Amplification and Deletions
  • RFLPs

Detailed mathematical analysis shows the
feasibility of this schemes with even a low
coverage maps.
16
A Simple Analysis
  • CR Coverage of the reference
  • CT Coverage of the individual
  • G Length of the genome
  • pE Cutting frequency of the enzyme
  • b Sizing error
  • L Length of the genomic fragment
  • X A region of differences
  • PrA single fragment covers X
  • (L-X)/(G-L)
  • PrAt least one of n random fragments covers X
  • 1- (L-X)/(G-L)n e-c(1-X/L)
  • Probability that the difference of length X is
    detected by genome-wide map
  • 1 e-CR(1-X/L) e-CT(1-X/L)
  • 1 - (b/2)(X pE 1)
  • (See Mathematica Demo GenomeCompare.nb)

17
An Example
  • Genome Length
  • G 3300 Mb
  • Average Fragment Length
  • L 2 Mb
  • Coverages
  • CR 12
  • CT 6
  • l 25 Kb e 3 Kb b e/l
  • pE 1/l
  • Probability that the difference of length X is
    detected by genome-wide map
  • 1 e-CR(1-X/L) e-CT(1-X/L)
  • 1 - (b/2)(X pE 1)
  • (See Mathematica Demo GenomeCompare.nb)

18
Some Interesting Applications
  • Haplotyping
  • Phasing haplotypes unambiguously (both for SNPs
    and RFLPs)
  • Rearrangement events
  • Amplifications and Deletions
  • Translocations
  • Synteny groups
  • Hemizygous Deletions

19
Improving the Resolution
  • Markers, characterized by genome-wide optical
    maps, can be indexed to genes.
  • Functionality of these genes can be established
    from
  • homology searches
  • motif identification or
  • simply literature searches
  • Many genes will need to be characterized in the
    context of experimental systems, and populations.
  • Genotype/phenoype relations can be established
    through large population studies, or kinship
    analysis, where the analysis is a combination of
  • high-resolution cytogenetics (optical maps) and
  • RFLP analysis.

20
GENTIG ALGORITHM
21
Genomic Contig ProblemGCP
  • Given M intervals (genomic DNA segments) each of
    length L
  • D1, D2, , DM
  • Dj ) 0 lt cj1 lt cj2 lt lt cjn lt L
  • cj1 True or False (Optical) Cut Sites
  • p Digestion rate,
  • k (gt3) Goodness
  • Goal Place M intervals on the real line by
    fixing the alignment (orientation position) of
    each D_j
  • Dj a Aj (Dj, sgn, xj
  • Int(Dj) Ij xj, xj L
  • Subject to additional constraints

22
Constraints forGCP
  • Composite Map
  • 0 lt m1 lt m2 lt L lt mK
  • 8 mi, ( mi 2 Aj / mi 2 Ij ) gt p
  • Every admissible placement induces a permutation
    of Dis (determined by the positions of their
    left ends)
  • p ! Permutation, Ap(1), Ap(2), , Ap(M)
  • Goodness
  • c(A1, , AM) min mi 2 Ap(j) Å Ap(j1) k

23
GCP is NPComplete
  • Transformation from Hamiltonian Path Problem
    restricted to cubic graphs.

Choose p 3/4 k M
24
NPCompleteness
  • G has a Hamiltonian path
  • v1, v2, vM
  • Then, the admissible placement is
  • D1, D2, DM
  • with at most two intervals Ij Ij1
  • overlapping with k cuts in common.
  • Conversely, any admissible
  • placement with a goodness gtk induces a
  • permutation p on the indices of the
  • vertices of G.
  • v(p(1)), v(p(2), , v(p(M))Hamiltonian

D1
D2
D3
Consensus Map
25
Overlap Rule
  • Comparing Two Genomic Restriction Maps

Given two maps A and B, we say that they overlap,
if --- 1. k or more of the restriction fragments
align positionally (subject to sizing error) 2.
Number of unmatched fragments in either prefix is
bounded by r
26
Comparing MapsEffect of Partial Digestion
  • Parameters
  • Partial digestion probability, p
  • Relative sizing error, b
  • Restriction fragments, n
  • Overlap threshold ratio, q
  • m n p Expected detected restriction
    fragments.
  • Controlling False Negative
  • K 5 np4 q/2 and r k1/p4, k1 ¼ 2
  • If in fact the clones A and B overlap then we
    will it detect with a probability, at least
  • (1-exp(-k1)) (1 exp(-n p4 q/8))

27
Overlap Rule
  • Controlling False Positive
  • Consider an arbitrary alignment Let the random
    variable W denote the number of fragments in
    clone A that positionally match with the
    fragments of clone B.
  • EC(W, i) C(m, i) (b/2)i ¼ (1/i!) (np b/2)i
  • By Bruns sieve
  • PrW i (1/i!) (b n p/2)i exp(-b n p /2)
  • Poisson b n p /2
  • and the false positive probability is
  • 4 r å1ik (1/i!) (b n p)I e-b n p/2
  • Make r as small k as large as possible

28
Experimental Design
  • Relation among the error parameters
  • 3b n p /4 5 k 5 n p4 q/2
  • ) p (3 b/2 q)1/3
  • Parameter choice for shotgun-mapping. Make the
    partial digestion probability rather high (close
    to 1) or the relative sizing error as low for
    instance by using a rare cutter.

29
Contour Plot as a Function of Sizing Error
(x-axis) and Digestion Rate (y-axis)
  • The calculation is for the human genome, G 3,300
    mb.
  • The average molecule length 5 mb, with an
    overlap of 1 mb
  • The average restriction fragment length 25 kb
  • For a sizing error of 3 kb, the required
    digestion rate is 80
  • If the sizing error is reduced to 2 kb, the
    required digestion rate drops to 70
  • (See Mathematica Demo GentigFeasibility.nb)

30
Gentig (GENomic conTIG) Algorithm
  • Scoring Function

- An upper bound estimate of the false positive
overlap probability - A Bayesian probability
estimate for the proposed placement
Maximize the Bayesian Probability Density subject
to the False Positive Probability Constraint
GREEDY ALGORITHM
31
Other Ongoing Projects
  • Valis Bioinformatic Environment Language
  • (Funded by DOE NYSTAR)
  • Microarray-based Genomic Mapping
  • (In collaboration with CSHL funded by NCI/NIH)
  • Expression Data Analysis
  • (In collaboration with NYU Biology funded by
    NSF)
  • Cell Informatics
  • (Funded by DARPA)

32
Valis Architecture
  • Valis aims to address all aspects of post-genomic
    biology.
  • With this goal in mind we built a powerful
    computational infrastructure
  • With a distributed architecture consisting of a
    Linux cluster and customized special hardware for
    homology search
  • A large database system for massive amounts of
    biological data in multiple forms
  • Mathematical, statistical and algorithmic tools
    that can handle the multitude of scientific
    problems arising from bioinformatics, comparative
    and functional genomics, cell informatics,
    population genomics, etc.

33
Algorithmic Support
  • Wide classes of mathematical and computational
    tools are integrated into Valis
  • Most of the work and the interesting technical
    developments are algorithmic. The tools rely on
    mathematical ideas from
  • combinatorics,
  • probabilistic methods,
  • statistics,
  • kinetic modeling,
  • and dynamical and discrete event systems.
  • For example, to construct probe maps using
    microarrays, the algorithm relies on a
    probabilistic analysis of when nearby probes get
    hybridized by a low coverage sample from a clone
    library. This analysis is built into the design
    of the microarray experiments, and also exploited
    by the data structures used in the algorithm.

34
Bio-computing
  • Joint Project involving Cold Spring Harbor Lab
    Courant Institute
  • Algorithmic Tools and Computational Frameworks
    for Cell Informatics 2001-2004
  • Two areas of interest
  • Computational Tools
  • Valis Informatics Tools
  • Simulation Tools
  • Reasoning Tools
  • Biological Experiments
  • DNA Evolution
  • Cell Communication

35
Cocultivation Experiments
  • Cells signal through communication proteins
  • Many communication proteins fall into two
    classes
  • Extracellular factors and
  • External receptors.
  • Factor-receptor interactions occur in pairs and
    influence the genes and proteins that cells
    express.
  • Factors and receptors are encoded by genes, about
    a thousand of each class.
  • Only a few of each class are expressed in cells
    of a particular type.
  • The pairing of factor and receptors are largely
    unknown.
  • The consequences of most factor-receptor
    interactions are unknown.
  • These pairings and their consequences are
    explored by cell cocultivation experiments.
  • We examine cell type A and B alone, and when
    cocultured (A c B)We examine the genes
    expressed by cells using DNA microarrays, that
    quantitate tens of thousands of genes
    simultaneously.

36
Experimental Results
  • Cell A is a carcinoma (derived from ectoderm),
  • Cell B is a sarcoma (derived from mesoderm),
  • The data displayed are ratios of expressed genes,
    each point the ratio of either A or B alone, or A
    and B cocultured (A c B) vs the combination of
    both A and B cultured alone and then combined
    (AB).

37
Simulation Inference
  • A Rudimentary Simulation with Mathematica
  • Specifications of cell types-

38
Concluding Remarks
  • The central claim of this proposal is that, by
    drawing upon mathematical approaches developed in
    the context of dynamical systems, kinetic
    analysis, computational theory and logic, it is
    possible to create powerful simulation, analysis
    and reasoning tools for working biologists to be
    used in deciphering existing data, devising new
    experiments and ultimately, understanding
    functional properties of genomes, proteomes,
    cells, organs and organisms.
  • Risks Challenges
  • Effectively integrating diverse methodologies to
    study a monolithic, heterogeneous and complex
    system
  • Multiple hierarchical levels of fidelity
  • Multiple spatio-temporal scales
  • Automatically designing experiments that can
    falsify and eventually revise existing models.
  • Milestones and Deliverables
  • Kinetic-modeling prototyping tool with VALIS
    interface
  • A quantitative sequence analysis tool
  • A genome-generator (based on a stochastic
    context-free grammar and simulation of
    genome-rearrangements)
  • A preliminary integrated tool combining
    simulation, visualization, numerical integration
    and symbolic algebraic analysis
  • A hybrid system simulator and modal logic
    reasoning system
  • A simulation and combinatorial/probabilistic
    analysis system for DNA repair, polymerase
    stuttering and unselected DNA drift (using
    kinetic models)
  • A multi-fidelity model of signal transduction
  • Experimental validation of microarray-based,
    computationally driven model of the RAS pathway
Write a Comment
User Comments (0)
About PowerShow.com