Summary of Some Ideas - PowerPoint PPT Presentation

1 / 148
About This Presentation
Title:

Summary of Some Ideas

Description:

Summary of Some Ideas – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 149
Provided by: csN6
Learn more at: https://cs.nyu.edu
Category:
Tags: cmv | ideas | summary

less

Transcript and Presenter's Notes

Title: Summary of Some Ideas


1
regeneron Seminar 6.4.2004

2
BAC to the Future(mapping, cgh, genetics
beyond)
  • Bud Mishra
  • Courant Inst. ² Cold Spring Harbor Lab. ² Tata
    Inst of Fund. Res. ² Mt. Sinai School of Medicine

3
Cancer
4
A Challenge
  • At present, description of a recently diagnosed
    tumor in terms of its underlying genetic lesions
    remains a distant prospect. Nonetheless, we look
    ahead 10 or 20 years to the time when the
    diagnosis of all somatically acquired lesions
    present in a tumor cell genome will become a
    routine procedure.
  • Douglas Hanahan and Robert Weinberg
  • Cell, Vol. 100, 57-70, 7 Jan 2000

5
Amplifications Deletions
6
Goals
  • Spontaneous Somatic Mutations
  • Common Amplifications and Homozygous Deletions in
    the Genome of Human Tumor Cells
  • Spontaneous Mutations in the Parental Germline
  • Sporadic Hereditary Diseases
  • Autism, Juvenile Schizopherenia, Childhood
    Neoplasms
  • Based on a Collection of LCR Probes,
    Representative of the Genome
  • Detailed Chromosomal Positions of the Probes are
    assumed unknown and may need to be created ab
    initio.

7
Biotechnology
Where we explore various tools of the trade
8
Tools of the TradeSCISSORS
  • Type II Restriction Enzyme
  • Biochemicals capable of cutting the
    double-stranded DNA by breaking two -O-P-O
    bridges on each backbone
  • Restriction Site
  • Corresponds to specific short sequences EcoRI
    GAATTC
  • Naturally occurring protein in bacteriaDefends
    the bacterium from invading viral DNABacterium
    produces another enzyme that methylates the
    restriction sites of its own DNA

9
Tools of the TradeGLUE
  • DNA Ligase
  • Cellular Enzyme Joins two strands of DNA
    molecules by repairing phosphodiester bonds
  • T4 DNA Ligase (E. coli infected with
    bacteriophage T4)
  • Hybridization
  • Hydrogen bonding between two complementary single
    stranded DNA fragments, or an RNA fragment and a
    complementary single stranded DNA fragment
    results in a double stranded DNA or a DNA-RNA
    fragment

10
Tools of the TradeCOPIER
  • DNA Amplification
  • Main Ingredients Insert (the DNA segment to be
    amplified), Vector (a cloning vector that
    combines with an insert to create a replicon),
    Host Organism (usually bacteria).

11
Tools of the TradeCOPIER
  • PCR (Polymerase Chain Reaction)
  • Main Ingredients Primers, Catalysts, Templates,
    and the dNTPs.

12
Karyotyping CGH
Where we examine existing methods to characterize
the Cancer Genome.
13
Karyotyping
14
Karyotyping
15
Karyotypic Analysis
Not enough chromosomes Turners Syndrome Too
many chromosomes Downs Syndrome Mixed up
pieces (Translocations) Philadelphia
Chromosome Missing pieces or Deletions
Cri-du-chat Syndrome Other anomilies Fragile X
Symdrome
16
Ploidy Analysis
  • Compare DNA content of unknown cell population to
    DNA content of reference cell population
  • If amount of DNA differs from the reference the
    unknown sample may be aneuploid (or haploid,
    triploid, tetraploid, etc.)

17
CGHComparative Genomic Hybridization.
  • Equal amounts of biotin-labeled tumor DNA and
    digoxigenin-labeled normal reference DNA are
    hybridized to normal metaphase chromosomes
  • The tumor DNA is visualized with fluorescein and
    the normal DNA with rhodamine
  • The signal intensities of the different
    fluorochromes are quantitated along the single
    chromosomes
  • The over-and underrepresented DNA segments are
    quantified by computation of tumor/normal ratio
    images and average ratio profiles

Amplification
Deletion
18
CGH Comparative Genomic Hybridization.
19
CGH Comparative Genomic Hybridization.
20
RDA Representational Differential Analysis
  • PCR based Set-Differencing
  • Two Sets S1 and S2
  • If x 2 S1 then both xw and xc have primers.
  • If x 2 S2 then neither xw and xc have primers.
  • If x 2 S1 n S2 then x undergoes an exponential
    growth.
  • If x 2 S1 Ã… S2 then x undergoes a linear growth.
  • If x 2 S2 n S1 then x has no growth.

21
SAGESerial Analysis of Gene Expression
  • Three principles underlie the SAGE methodology
  • A short sequence tag (10-14bp) contains
    sufficient information to uniquely identify a
    transcript provided that that the tag is obtained
    from a unique position within each transcript.
  • Sequence tags can be linked together to from long
    serial molecules that can be cloned and
    sequenced.
  • Quantitation of the number of times a particular
    tag is observed provides the expression level of
    the corresponding transcript.

22
Microarray Based Methods
  • RNA expression microarray analysis
  • Analysis of DNA copy number changes using CGH to
    microarrayed BACs
  • Analysis of DNA copy number changes using
    microarrayed cDNAs and ESTs

23
Analysis of copy number changes
Where we develop a novel method to find copy
number fluctuations ROMA arrayCGH
24
Microarray Analysis
  • Representations are reproducible samplings of DNA
    populations in which the resulting DNA has a new
    format and reduced complexity.
  • We array probes derived from low complexity
    representations of the normal genome
  • We measure differences in gene copy number
    between samples ratiometrically
  • Since representations have a lower nucleotide
    complexity than total genomic DNA, we obtain a
    stronger specific hybridization signal relative
    to non-specific and noise

25
Tumor vs. Normal
  • Copy number can be measured by computing the fold
    changes
  • Yellow Copy number unchanged
  • Red Amplification (More tumor material than
    normal)
  • Green Deletion (Less tumor material than normal)

26
Sir Ernest Rutherford
For Mikes sake, Soddy, dont call it
transmutation. Theyll have our heads off as
alchemists. Rutherford, winner of 1908 Nobel
prize for chemistry for cataloging alpha and beta
particles
  • All science is either physics or stamp
    collecting.

27
Low Complexity Representation
  • Superior Hybridization Kinetics and
    Signal-to-Noise Ratio.
  • Reproducible, Reliable Consistent.
  • Can be prepared in large amounts from microscopic
    amounts of material.
  • Parallel representations preserve gene ratios
    between samples treated in parallel.

28
MAP (Maximum A Posteriori) Estimation Algorithm
Where we develop a novel algorithm to segment
regions of similar copy number alterations
29
How Representations are Made..
30
BglII Representation (3)
31
Copy Number Fluctuation
32
HMM
33
HMM, finally
Model with a very high degree of freedom, but not
enough data points. Small Sample statistics a
Overfitting, Convergence to local maxima, etc.
3
1
2
34
HMM, last time
  • Advantages
  • Small Number of parameters. Can be optimized by
    MAP estimator. (EM has difficulties).
  • Easy to model deviation from Markvian properties
    (e.g., polymorphisms, power-law, Polyas urn like
    process, local properties of chromosomes, etc.)

We will simply model the number of break-points
by a Poisson process, and lengths of the
aberrational segments by an exponential
process. Two parameter model pb pe
35
A MAP (Maximum A Posteriori) Estimators
  • The prior depends on two parameters pe and pb.
  • pe is the probability of a particular probe being
    normal.
  • pb is the average number of intervals per unit
    length.

Generalizes HMM
  • Priors
  • Deletion Amplification
  • Data
  • Priors Noise
  • Goal Find the most plausible hypothesis of
    regional changes and their associated copy numbers

36
Likelihood Function
  • The µ values of non-global probes are unknown.
  • We estimate these µ values using the sample mean
    for that interval.
  • Our Bayesian solution maximizes L to yield the
    optimal segmentation

37
A dynamic programming algorithm.
  • Generalizes VITERBI
  • Extension
  • Adds a new interval to the end.
  • Likelihood function can be incrementally computed

38
A reasonable choice of priors yields good
segmentation.
39
A reasonable choice of priors yields good
segmentation.
40
Sir Ernest Rutherford
  • If your experiment needs statistics, you ought
    to have done a better experiment.

41
Prior Selection F criterion
  • For each break we have a T2 statistic and the
    appropriate tail probability (p value) calculated
    from the distribution of the statistic. In this
    case, this is an F distribution.
  • The best (pe,pb) is the one that leads to the
    maximum min p-value.

42
Thought Experiments, Algorithms Simulations
Where we think about how to assign chromosomal
locations to probes using array hybridization
43
Locations of the Probes
44
Locations of the Probes
45
Mapping Representational Probes
  • Statistics of inter-probe pair-wise distance
    measurements
  • Estimating distances by hybridization with pools
    of clones from a library
  • Simulation results

46
Sir Ernest Rutherford
  • We haven't the money, so we've got to think."

47
Measuring distances
  • A one dimensional Buffons needle problem.
  • Take two points on a line, and drop unit-length
    needles of some color.
  • The probability that the two points will have
    different colors monotonically increases with the
    distance between these two points
  • as distance increases from 0 to 1
  • attains a fixed value for all distances konger
    than 1.
  • One can generalize by considering
  • More than two pointsP points.
  • Dropping a small set of bichromatic needles

p
p
p
Distance ¼ 3/6 0.5
48
The Experiments
cX coverage subsample
cX coverage subsample
  • Probes are points
  • BACs are needles
  • Hybridization on an array simulates dropping the
    bichromatic needles

M
High Coverage BAC Library
cX coverage subsample
cX coverage subsample
49
A Mathematical Problem
  • A set of P points x1, x2, , xP µ 0,G with
    pdf f(x) 1/G i.i.d. for all x 2 0,G
  • Distance di,j d(xi xj), measured between
    two arbitrary points xi and xj x.
  • Given O(P2) distances infer positions.

50
Distance vs Observed
51
Matrix-to-Line
  • Given a P P positive symmetric real-valued
    matrix D of measured distances.
  • The entry di,j f(d x).
  • Choose an embedding of the points
  • x1, x2, , xP ½ 0,G,
  • which maximizes a likelihood function
  • Õ1 i, j P f(xi xj di,j)

52
Bayes Formula
53
Minimizing a Quadratic Cost Function
54
A Physical Model
P2
P3
P2
P1
P4
d1,2
d2,3
d2,4
P1
P3
d1,3
d3,4
d1,4
Mass-less Balls connected with springs of
different stiffness
P4
55
Algorithm
Join
  • Consider measured distances of length L q L
    Examine these distances in increasing order.
  • q 2 (0,1) to be determined by the Chernoff
    bounds
  • Initially, every probe is a singleton contig.
  • Two operations Join and Adjust either combines
    smaller contigs or improve an existing contig.

56
Algorithm
Adjust
  • Join and adjust locally minimizes the
    log-likelihood cost function
  • Local minimum of a weighted sum-of-square error
    function

57
Algorithmic Complexity
58
The Experiments
  • Outcomes for probe pi
  • Pi hybridizes to zero BACs.
  • Outcome B (blank)
  • Pi hybridizes to at least one red BACs and zero
    green BACs.
  • Outcome R (red)
  • Pi hybridizes to zero red BACs and at least one
    green BACs.
  • Outcome G (green)
  • Pi hybridizes to at least one red BACs and at
    least one green BACs.
  • Outcome Y (yellow)
  • We call these events iB, iR, iG and iY
    respectively.

59
Hamming Distance
  • The full experiment consists of M random samples.
  • The output is a color string for each probe.
  • sj h sj,k ik1M with sj,k 2 B, R, G, Y
  • associated with probe pj
  • Hi,j places where si and sj differ
  • Ci,j places where si and sj are the same but
    not blank
  • Hi,j places where si and sj are blank

60
Notations
  • Nf Clones per experiment
  • M Experiments
  • L Length of a clone,
  • G Length of a genome
  • a Nf/G PrA clone starts at a site
  • c NfL/G a L coverage per experiment
  • a aG aR a/2 c/2L

61
Computing the Probabilities
  • Probability of Events
  • C (iG Æ jG) Ç (iR Æ jR) Ç (iY Æ jY)
  • T (i Æ j)
  • H (C Ç T)

62
Computing the Probabilities
63
Computing the Probabilities
  • Pr(C x 5 L)
  • 1 2 exp(-a L) 2 exp(-a (Lx))2 exp(-2
    a (Lx))
  • Pr(T x 5 L) exp(-2 a (Lx))
  • Pr(H x 5 L) 1-1 2 exp(-a L) 2 exp(-a
    (Lx))2

64
Computing the Probabilities
65
Final Estimator
66
Chernoff Bound
  • False Positives (d lt q L) Æ (x gt L)
  • False Negatives (x lt q L) Æ (d gt L)

67
Computing the Chernoff Bounds
68
Yeast Mapping
69
Steps in Mapping
70
Data from One Experiment
71
Expectation Maximization
72
Map
73
Local Distances
74
Sequence Validation
75
Sequence Validation
76
Sequence Validation
77
Sir Ernest Rutherford
  • I have become more and more impressed by the
    power of the scientific method of extending our
    knowledge of nature.
  • Experiment, directed by the imagination of either
    an individual, or still better of a group of
    individuals of varied mental outlook is able to
    achieve results which far transcend the
    imagination alone of the greatest natural
    philosopher.

78
Sir Ernest Rutherford
  • Experiment without imagination, or imagination
    without recourse to experiment, can accomplish
    little. But for effective progress, a happy blend
    of these powers is necessary

79
  • Students
  • Fang Chen
  • Jiawu Feng
  • Ofer Gill
  • Matthias Heymann
  • Iuliana Ionita
  • Venkatesh P. Mysore
  • Marina Spivak
  • Bing Sun
  • Yi (Joey) Zhou
  • Visitors
  • Marco Isopi
  • Carla Piazza
  • Alberto Policriti
  • Naomi Silver
  • Chris Wiggins
  • Franz Winkler
  • Principal Investigator
  • Bud Mishra
  • Researchers
  • Marco Antoniotti
  • Paolo Barbano
  • Vera Cherepinsky
  • Raoul-Sam Daruwala
  • Gilad Lerman
  • Joe McQuown
  • Toto Paxia
  • Archisman Rudra
  • Nadia Ugel
  • Alumni
  • Will Casey
  • Marc Rejali

80
The End
  • http//www.cs.nyu.edu/mishra
  • http//bioinformatics.cat.nyu.edu
  • Valis, Gene Grammar, NYU MAD, Cell Simulation,

81
Other Ongoing Projects
  • SINGLE MOLECULE MAPPING
  • Single Molecule Genomics Optical Mapping,
    Optical Sequencing RFLP Haplotyping
  • (In collaboration with Wisc Funded by NCI)
  • ARRAY MAPPING
  • (In collaboration with MIT Funded by NSF ITR)
  • ARRAY CGH
  • Microarray-based Genome Mapping--
  • (In collaboration with NYU Med School CSHL ---
    funded by NCI/NIH)
  • EXPRESSION DATA ANALYSIS
  • (In collaboration with NYU Biology Med School
    funded by NSF MHHI)

82
SINGLE MOLECULE OPTICALMAPPING
83
Error Sources
  • Sizing Error
  • (Bernoulli labeling, absorption cross-section,
    PSF)
  • Partial Digestion
  • False Optical Sites
  • Orientation
  • Spurious molecules, Optical chimerism, Calibration

Image of restriction enzyme digestedYAC clone
YAC clone 6H3, derived from human chromosome 11,
digested with the restriction endonuclease Eag I
and Mlu I, stained with a fluorochrome and imaged
by fluorescence microscopy.
84
Optical MappingInterplay between Biology and
Computation
85
Y
  • From a genes point of view, reshuffling is a
    great restorative
  • The Y, in its solitary state disapproves of such
    laxity. Apart from small parts near each tip
    which line up with a shared section of the X, it
    stands aloof from the great DNA swap. Its genes,
    such as they are, remain in purdah as the
    generations succeed. As a result, each Y is a
    genetic republic, insulated from the outside
    world. Like most closed societies it becomes both
    selfish and wasteful. Every lineage evolves an
    identity of its own which, quite often, collapses
    under the weight of its own inborn weaknesses.
  • Celibacy has ruined mans chromosome.
  • Steve Jones, Y The descent of Men, 2002.

86
Mapping the DAZ locus on Y Chromosome
87
GCP is NP-Complete
  • Transformation from Hamiltonian Path Problem
    restricted to cubic graphs.

Choose p 3/4 k M
88
NPCompleteness
  • G has a Hamiltonian path
  • v1, v2, vM
  • Then, the admissible placement is
  • D1, D2, DM
  • with at most two intervals Ij Ij1
  • overlapping with k cuts in common.
  • Conversely, any admissible
  • placement with a goodness gtk induces a
    permutation p on the indices of the vertices of
    G.
  • v(p(1)), v(p(2), , v(p(M))Hamiltonian

v1
v2
v3
D1
D2
D3
Consensus Map
89
Experiment Design
  • Relation among the error parameters
  • 3b n p /4 5 k 5 n p4 q/2
  • ) p (3 b/2 q)1/3
  • Parameter choice for shotgun-mapping. Make the
    partial digestion probability rather high (close
    to 1) or the relative sizing error as low for
    instance by using a rare cutter.

90
Contour Plot as a Function of Sizing Error
(x-axis) and Digestion Rate (y-axis)
  • The calculation is for the human genome, G 3,300
    mb.
  • The average molecule length 5 mb, with an
    overlap of 1 mb
  • The average restriction fragment length 25 kb
  • For a sizing error of 3 kb, the required
    digestion rate is 94
  • If the sizing error is reduced to 2 kb, the
    required digestion rate drops to 88
  • If the sizing error is reduced to 1 kb, the
    required digestion rate drops to 80
  • (See Mathematica Demo)

91
Gentigs Successes
  • E. coli
  • P. falciparum D. radiodurans Y. Pestis
  • Rhodobacter sphaeroides Shigella flexneri
    Salmonella enterica
  • Aspergillus fumigatus
  • The automated Gentig system is routinely used
  • to map microbe genomes quickly effortlessly
  • by scientists with no quantitative or
    computational training.

Shotgun Optical Mapping of Genomes
92
VALISvast active living intelligence
system
93
(No Transcript)
94
Key Feature of Valis
  • State of the art of rapid prototyping in
    bioinformatics, functional genomics and systems
    biology.
  • Multilanguage Scripting
  • Data storage
  • Graphical User interfaces

95
Visual Genome
96
Multi-Scripting
  • A Valis script can be written in any supported
    language
  • JScript, VBScript, Python, PERL, Lisp, R and
    SETL.
  • All the scripts see the same Valis class
    hierarchy.
  • For example, once a user learns that a Valis
    Sequence Object has a method called Input that
    will read the sequence from a file, the user can
    subsequently use this same primitive from all the
    different languages.

97
Advantages
  • We can take the best from each language
  • Graph algorithms in SETL
  • Sockets in Python
  • Regular Expressions in Perl
  • AI in Lisp
  • Statistics in R
  • ..

98
Data Storage
  • Based on Extended B-Trees
  • At the lowest level there is an Heap of pages
  • Must correctly keep track of the reference counts
    of each record/object to implement value semantics

99
B Tree Indexes
  • Leaf pages contain data entries, and are chained
    (prev next)
  • Non-leaf pages contain index entries and direct
    searches

Non-leaf Pages
Leaf Pages
100
Hardware
  • Although Valis is designed to be used in
    workstations, we can run the computation
    intensive processes on Beowulf computing
    servers.This cluster has
  • 16 compute nodes connected via a Gigabit Ethernet
    and a low-latency, high speed network from
    Dolphinics.
  • Cluster nodes are dual 2.4GHz Intel Xeon
    processors with 4GB of memory.
  • Itis arranged in a 3D torus topology allowing
    each node to communicate directly with its three
    nearest neighbors.
  • The disk storage capacity for hosting our
    databases is about one Terabyte.
  • The operating system is Linux and we use high
    performance MPI libraries from Scali.

101
Visualization
  • Once the processing is completed, it is very
    important to be able to quickly visualize the
    results.
  • For this reason Valis provides numerous
    visualization tools that allow a user to quickly
    display
  • sequences, maps,
  • microarray data,
  • tables,
  • graphs and annotations.
  • These widget can be customized from the scripts.

102
Valis Demo
  • Bioinformatics 1.1, 1.2, 1.3
  • Systems Biology Simple Pathway 2.1, 2.2, 2.3
  • Systems Biology Apoptosis 2.4
  • Simpathica
  • XS-System
  • BioWave
  • BioSim
  • NYUMAD

103
SIMPATHICASystems Biology
How much of reasoning about biology can be
automated?
104
Why do we need a tool?
We claim that, by drawing upon mathematical
approaches developed in the context of dynamical
systems, kinetic analysis, computational theory
and logic, it is possible to create powerful
simulation, analysis and reasoning tools for
working biologists to be used in deciphering
existing data, devising new experiments and
ultimately, understanding functional properties
of genomes, proteomes, cells, organs and
organisms.
Simulate Biologists! Not Biology!!
105
Reasoning and Experimentation
106
Simpathica is a modular system
Canonical Form
  • Characteristics
  • Predefined Modular Structure
  • Automated Translation from Graphical to
    Mathematical Model
  • Scalability

107
Glycolysis
Glycogen
P_i
Glucose-1-P
Glucose
Phosphorylase a
Phosphoglucomutase
Glucokinase
Glucose-6-P
Phosphoglucose isomerase
Fructose-6-P
Phosphofructokinase
108
Reaction Scheme for Wnt Signaling.
The reaction steps of the Wnt pathway are
numbered 1 to 19. Protein complexes are denoted
by the names of their components. Phosphorylated
components are marked by an asterisk.
Single-headed solid arrows characterize
irreversible reactions. Double-headed arrows
denote binding equilibria. Blue arrows mark
reactions that have only been taken into account
when studying the effect of high axin
concentrations.
109
Broken arrows represent activation of Dsh by the
Wnt ligand (step 1), Dsh-mediated initiation of
the release of GSK3b from the destruction complex
(step 3), and APC-mediated degradation of axin
(step 15). The broken arrows indicate that the
components mediate but do not participate
stoichiometrically in the reaction scheme. The
irreversible reactions 2, 4, 5, 911, and 13 are
unimolecular, and reactions 6, 7, 8, 16, and 17
are reversible binding steps.
110
Steady State Concentration
111
\beta-catenin degradation
112
Wnt Demo
  • Systems Biology Wnt Pathway 3.1, 3.2
  • SimpathicaA

113
The Cell Cycle
G1
start
cell division
Cdk
Cdk
Cdk
Cyclin
S
M (anaphase)
APC
APC
finish
G2
M (metaphase)
114
Cyclin B/Cdk and Cdh1/APC
  • dCycB/dt
  • k1 (k2 k2Cdh1)CycB
  • dCdh1/dt
  • (k3 k3 A) (1-Cdh1)/ (J31 Cdh1)
    k4 m CycBCdh1/ (J4 Cdh1)
  • A pair of nonlinear ODE (ordinary differential
    equations) describing the biochemical reactions
    at the center.

115
Simulation of Yeast Cell Cycle
116
Simulation of Yeast Cell Cycle
117
Simulation of Yeast Cell Cycle
118
The Natural Language Interface
119
Story generation
  • Temporal Logic formulae can be rendered in
    English.
  • Temporal Logic formulae can be generated
    automatically (with care).
  • Each formula can be tested against a set of
    datasets differences can then be noted.

120
Cell Cycle Story Generation Results (HTML
rendering)
Report on "Test Experiment Tyson WT, 1 Mutant, 2
Mutants.".RESULTSThe results refer to the
following datasets The first dataset is named
"Ian's Experiment/Tyson Yeast Dataset WT". The
second dataset is named "Ian's Experiment/Tyson
Yeast Dataset Mut1". The third dataset is named
"Ian's Experiment/Tyson Yeast Dataset mut2".
  • CDH1 less than or equal to 1.0071783 will
    always hold until CDH1 activates CYCB, is true
    in the first dataset, is true in the second
    dataset, and is false in the third dataset.
  • CDH1 represses CYCB implies CYCB is greater than
    or equal to 0.65, is false in the first
    dataset, is true in the second dataset, and is
    true in the third dataset.
  • eventually, CDH1 is less than or equal to CYCB,
    is false in the first dataset, is true in the
    second dataset, and is true in the third
    dataset.

121
GenomicsLarge Segmental Duplications
122
Recent Segmental Duplications
Human
  • 3.5 5 of the human genome is found to contain
  • segmental duplications, with length gt 5 or 1kb,
    identity gt 90.
  • August, 2001 assembly,
  • Bailey, et al. 2002.
  • April, 2003 assembly,
  • Cheung, et al. 2003.
  • These duplications are estimated to have emerged
    about 40Mya under neutral assumption.
  • The duplications are mostly interspersed
    (non-tandem), and happen both inter- and
    intra-chromosomally.

From Bailey, et al. 2002
123
Recent Segmental Duplications
Mouse
  • 1.2 of the mouse genome is found to contain
    segmental duplications, with length gt 5kb,
    identity gt 90.
  • February, 2003 mouse assembly,
  • Cheung, et al. 2003.
  • These duplications are estimated to have emerged
    about 25Mya under neutral assumption.
  • The duplications happen both inter- and
    intra-chromosomally.

From Cheung, et al. 2003
124
Statistical AnalysisDuplication Flanking
Sequences
  • What are the molecular mechanisms that caused the
    recent segmental duplications in the human and
    mouse genomes?
  • Thermodynamic instability in the DNA sequences
  • Recombination between homologous repeat elements
  • Other unknown mechanisms.

125
Thermodynamics
126
Hypotheses
Scenario 1 Repeat-Mediated Homologous
recombination
Scenario 2 Preferential Repeat Insertion after
Duplication
Scenario 3 Artificial Boundary Effect from
Duplication Mapping
Duplicated segment
Duplicated segment
Duplicated segment
Overrepresentation of repeats in the flanking
regions
127
The Model
128
The Mathematical Model
h1 proportion of duplications by repeat
recombination h1 proportion of
duplications by recombination of the specific
repeat h1- - proportion of duplications
by recombination of other repeats h0 proportion
of duplications by other repeat-unrelated
mechanism h0 proportion of h0 with
common specific repeat in the flanking regions
h0- proportion of h0 with no common
specific repeat in the flanking regions
h0- - proportion of h0 with no specific repeat
in the flanking regions
a mutation rate in duplicated sequences ß
insertion rate of the specific repeat ?
mutation rate in the specific repeat d
divergence level of duplications e divergence
interval of duplications.
129
Model Validation
Alu
L1
f - -
f - -
f -
f -
f
f
Diversity
Diversity
  • The model parameters (aAlu, ßAlu, ?Alu, aL1, ßL1,
    ?L1) are estimated from the reported mutation and
    insertion rates in the literature.
  • The relative strengths of the alternative
    hypotheses can be estimated by model fitting to
    the real data.
  • h1Alu 0.76 h1Alu 0.3 h1L1 0.76 h1 L1
    0.35.

130
ChIP-Chip Analysis
131
Details (math)
  • Idea in a nutshell (assume symmetric data)
  • Throw out genes which deviate significantly at
    various scales
  • Stop at exhausted scales (cubes)
  • Threshold in stopping cubes by estimating
    Cs(yx)
  • Average over shifted grids

132
Procedure of Algorithm
  • Recall Fixed dyadic grid along L
  • Compute FQ (also fQ, ßQ,s Q) in top-down alg
  • Stop at an interval if either

133
Normalization.
  • Recall setting (simplified) DataN2 matrix of
    log. EVs
  • Problem systematic variation Different EVs
    are recorded for same amount of mRNA.
  • Normalization Removal of variation to allow
    balanced comparisons

134
Normalization (continues)
  • Related stat. terminology conditional mean
    estimation previous terminology
  • Related math problem Construct a graph (or
    chord-arc curve) a strip around it
  • Approach to solve math problem Combine ideas of
    multiscale curve/graph constructions (Jones,
    David and Semmes, L) with the multistrip
    construction before

135
ChIP-Chip Experiments I
136
ChIP-Chip Experiments II
137
ChIP-Chip Experiments III
138
CARTWheelRedescription
139
What is redescription?
  • Shift of vocabulary from one language (descriptor
    family) to another to describe the same entity
  • Descriptor is any meaningful way of defining a
    subset within a universal set of entities
  • Set theoretic operations used on basic
    descriptors to define derived descriptors
  • Evaluated on the basis of Jaccards coefficient
    (A ltgt B) (A ? B) / (A U B)

140
Why Redescribe?
  • Allows feature construction
  • Can handle any kind of data in terms of
    descriptors no data specific mining required
  • Can find commonalities and differences between
    various descriptors/descriptor families at the
    same time
  • Can look for stories using a series of inexact
    redescriptions

141
CARTwheels algorithm for redescription
142
CARTwheels algorithm for redescription
143
Implementation details
  • Simplified version of CARTwheels algorithm used
    to speed up the process
  • Algorithm implemented in C on UNIX
  • Visualization implemented in Java on UNIX
  • Interacts with Postgres based database to extract
    data/descriptors

144
Implementation details descriptors used
  • Experimental (microarray) data for yeast from
    Gasch et al. Descriptors constructed of the form
    gt, lt
  • 9 different stress used from Gasch et al. data
  • GO category assignments for genes (biological
    process, cellular component, molecular function)

145
Design of System
146
(No Transcript)
147
(No Transcript)
148
(No Transcript)
149
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com