Title: Summary of Some Ideas
1regeneron Seminar 6.4.2004
2BAC to the Future(mapping, cgh, genetics
beyond)
-
- Bud Mishra
-
- Courant Inst. ² Cold Spring Harbor Lab. ² Tata
Inst of Fund. Res. ² Mt. Sinai School of Medicine
3Cancer
4A Challenge
- At present, description of a recently diagnosed
tumor in terms of its underlying genetic lesions
remains a distant prospect. Nonetheless, we look
ahead 10 or 20 years to the time when the
diagnosis of all somatically acquired lesions
present in a tumor cell genome will become a
routine procedure. - Douglas Hanahan and Robert Weinberg
- Cell, Vol. 100, 57-70, 7 Jan 2000
5Amplifications Deletions
6Goals
- Spontaneous Somatic Mutations
- Common Amplifications and Homozygous Deletions in
the Genome of Human Tumor Cells - Spontaneous Mutations in the Parental Germline
- Sporadic Hereditary Diseases
- Autism, Juvenile Schizopherenia, Childhood
Neoplasms - Based on a Collection of LCR Probes,
Representative of the Genome - Detailed Chromosomal Positions of the Probes are
assumed unknown and may need to be created ab
initio.
7Biotechnology
Where we explore various tools of the trade
8Tools of the TradeSCISSORS
- Type II Restriction Enzyme
- Biochemicals capable of cutting the
double-stranded DNA by breaking two -O-P-O
bridges on each backbone - Restriction Site
- Corresponds to specific short sequences EcoRI
GAATTC - Naturally occurring protein in bacteriaDefends
the bacterium from invading viral DNABacterium
produces another enzyme that methylates the
restriction sites of its own DNA
9Tools of the TradeGLUE
- DNA Ligase
- Cellular Enzyme Joins two strands of DNA
molecules by repairing phosphodiester bonds - T4 DNA Ligase (E. coli infected with
bacteriophage T4) - Hybridization
- Hydrogen bonding between two complementary single
stranded DNA fragments, or an RNA fragment and a
complementary single stranded DNA fragment
results in a double stranded DNA or a DNA-RNA
fragment
10Tools of the TradeCOPIER
- DNA Amplification
- Main Ingredients Insert (the DNA segment to be
amplified), Vector (a cloning vector that
combines with an insert to create a replicon),
Host Organism (usually bacteria).
11Tools of the TradeCOPIER
- PCR (Polymerase Chain Reaction)
- Main Ingredients Primers, Catalysts, Templates,
and the dNTPs. -
12Karyotyping CGH
Where we examine existing methods to characterize
the Cancer Genome.
13Karyotyping
14Karyotyping
15Karyotypic Analysis
Not enough chromosomes Turners Syndrome Too
many chromosomes Downs Syndrome Mixed up
pieces (Translocations) Philadelphia
Chromosome Missing pieces or Deletions
Cri-du-chat Syndrome Other anomilies Fragile X
Symdrome
16Ploidy Analysis
- Compare DNA content of unknown cell population to
DNA content of reference cell population - If amount of DNA differs from the reference the
unknown sample may be aneuploid (or haploid,
triploid, tetraploid, etc.)
17CGHComparative Genomic Hybridization.
- Equal amounts of biotin-labeled tumor DNA and
digoxigenin-labeled normal reference DNA are
hybridized to normal metaphase chromosomes - The tumor DNA is visualized with fluorescein and
the normal DNA with rhodamine - The signal intensities of the different
fluorochromes are quantitated along the single
chromosomes - The over-and underrepresented DNA segments are
quantified by computation of tumor/normal ratio
images and average ratio profiles
Amplification
Deletion
18CGH Comparative Genomic Hybridization.
19CGH Comparative Genomic Hybridization.
20RDA Representational Differential Analysis
- PCR based Set-Differencing
- Two Sets S1 and S2
- If x 2 S1 then both xw and xc have primers.
- If x 2 S2 then neither xw and xc have primers.
- If x 2 S1 n S2 then x undergoes an exponential
growth. - If x 2 S1 Ã… S2 then x undergoes a linear growth.
- If x 2 S2 n S1 then x has no growth.
21SAGESerial Analysis of Gene Expression
- Three principles underlie the SAGE methodology
- A short sequence tag (10-14bp) contains
sufficient information to uniquely identify a
transcript provided that that the tag is obtained
from a unique position within each transcript. - Sequence tags can be linked together to from long
serial molecules that can be cloned and
sequenced. - Quantitation of the number of times a particular
tag is observed provides the expression level of
the corresponding transcript.
22Microarray Based Methods
- RNA expression microarray analysis
- Analysis of DNA copy number changes using CGH to
microarrayed BACs - Analysis of DNA copy number changes using
microarrayed cDNAs and ESTs
23Analysis of copy number changes
Where we develop a novel method to find copy
number fluctuations ROMA arrayCGH
24Microarray Analysis
- Representations are reproducible samplings of DNA
populations in which the resulting DNA has a new
format and reduced complexity. - We array probes derived from low complexity
representations of the normal genome - We measure differences in gene copy number
between samples ratiometrically - Since representations have a lower nucleotide
complexity than total genomic DNA, we obtain a
stronger specific hybridization signal relative
to non-specific and noise
25Tumor vs. Normal
- Copy number can be measured by computing the fold
changes - Yellow Copy number unchanged
- Red Amplification (More tumor material than
normal) - Green Deletion (Less tumor material than normal)
26Sir Ernest Rutherford
For Mikes sake, Soddy, dont call it
transmutation. Theyll have our heads off as
alchemists. Rutherford, winner of 1908 Nobel
prize for chemistry for cataloging alpha and beta
particles
- All science is either physics or stamp
collecting.
27Low Complexity Representation
- Superior Hybridization Kinetics and
Signal-to-Noise Ratio. - Reproducible, Reliable Consistent.
- Can be prepared in large amounts from microscopic
amounts of material. - Parallel representations preserve gene ratios
between samples treated in parallel.
28MAP (Maximum A Posteriori) Estimation Algorithm
Where we develop a novel algorithm to segment
regions of similar copy number alterations
29How Representations are Made..
30BglII Representation (3)
31Copy Number Fluctuation
32HMM
33HMM, finally
Model with a very high degree of freedom, but not
enough data points. Small Sample statistics a
Overfitting, Convergence to local maxima, etc.
3
1
2
34HMM, last time
- Advantages
- Small Number of parameters. Can be optimized by
MAP estimator. (EM has difficulties). - Easy to model deviation from Markvian properties
(e.g., polymorphisms, power-law, Polyas urn like
process, local properties of chromosomes, etc.)
We will simply model the number of break-points
by a Poisson process, and lengths of the
aberrational segments by an exponential
process. Two parameter model pb pe
35A MAP (Maximum A Posteriori) Estimators
- The prior depends on two parameters pe and pb.
- pe is the probability of a particular probe being
normal. - pb is the average number of intervals per unit
length.
Generalizes HMM
- Priors
- Deletion Amplification
- Data
- Priors Noise
- Goal Find the most plausible hypothesis of
regional changes and their associated copy numbers
36Likelihood Function
- The µ values of non-global probes are unknown.
- We estimate these µ values using the sample mean
for that interval. - Our Bayesian solution maximizes L to yield the
optimal segmentation
37A dynamic programming algorithm.
- Generalizes VITERBI
- Extension
- Adds a new interval to the end.
- Likelihood function can be incrementally computed
38A reasonable choice of priors yields good
segmentation.
39A reasonable choice of priors yields good
segmentation.
40Sir Ernest Rutherford
- If your experiment needs statistics, you ought
to have done a better experiment.
41Prior Selection F criterion
- For each break we have a T2 statistic and the
appropriate tail probability (p value) calculated
from the distribution of the statistic. In this
case, this is an F distribution. - The best (pe,pb) is the one that leads to the
maximum min p-value. -
42Thought Experiments, Algorithms Simulations
Where we think about how to assign chromosomal
locations to probes using array hybridization
43Locations of the Probes
44Locations of the Probes
45Mapping Representational Probes
- Statistics of inter-probe pair-wise distance
measurements - Estimating distances by hybridization with pools
of clones from a library - Simulation results
46Sir Ernest Rutherford
- We haven't the money, so we've got to think."
47Measuring distances
- A one dimensional Buffons needle problem.
- Take two points on a line, and drop unit-length
needles of some color. - The probability that the two points will have
different colors monotonically increases with the
distance between these two points - as distance increases from 0 to 1
- attains a fixed value for all distances konger
than 1. - One can generalize by considering
- More than two pointsP points.
- Dropping a small set of bichromatic needles
p
p
p
Distance ¼ 3/6 0.5
48The Experiments
cX coverage subsample
cX coverage subsample
- Probes are points
- BACs are needles
- Hybridization on an array simulates dropping the
bichromatic needles
M
High Coverage BAC Library
cX coverage subsample
cX coverage subsample
49A Mathematical Problem
- A set of P points x1, x2, , xP µ 0,G with
pdf f(x) 1/G i.i.d. for all x 2 0,G - Distance di,j d(xi xj), measured between
two arbitrary points xi and xj x. - Given O(P2) distances infer positions.
50Distance vs Observed
51Matrix-to-Line
- Given a P P positive symmetric real-valued
matrix D of measured distances. - The entry di,j f(d x).
- Choose an embedding of the points
- x1, x2, , xP ½ 0,G,
- which maximizes a likelihood function
- Õ1 i, j P f(xi xj di,j)
52Bayes Formula
53Minimizing a Quadratic Cost Function
54A Physical Model
P2
P3
P2
P1
P4
d1,2
d2,3
d2,4
P1
P3
d1,3
d3,4
d1,4
Mass-less Balls connected with springs of
different stiffness
P4
55Algorithm
Join
- Consider measured distances of length L q L
Examine these distances in increasing order. - q 2 (0,1) to be determined by the Chernoff
bounds - Initially, every probe is a singleton contig.
- Two operations Join and Adjust either combines
smaller contigs or improve an existing contig.
56Algorithm
Adjust
- Join and adjust locally minimizes the
log-likelihood cost function - Local minimum of a weighted sum-of-square error
function
57Algorithmic Complexity
58The Experiments
- Outcomes for probe pi
- Pi hybridizes to zero BACs.
- Outcome B (blank)
- Pi hybridizes to at least one red BACs and zero
green BACs. - Outcome R (red)
- Pi hybridizes to zero red BACs and at least one
green BACs. - Outcome G (green)
- Pi hybridizes to at least one red BACs and at
least one green BACs. - Outcome Y (yellow)
- We call these events iB, iR, iG and iY
respectively.
59Hamming Distance
- The full experiment consists of M random samples.
- The output is a color string for each probe.
- sj h sj,k ik1M with sj,k 2 B, R, G, Y
- associated with probe pj
- Hi,j places where si and sj differ
- Ci,j places where si and sj are the same but
not blank - Hi,j places where si and sj are blank
60Notations
- Nf Clones per experiment
- M Experiments
- L Length of a clone,
- G Length of a genome
- a Nf/G PrA clone starts at a site
- c NfL/G a L coverage per experiment
- a aG aR a/2 c/2L
61Computing the Probabilities
- Probability of Events
- C (iG Æ jG) Ç (iR Æ jR) Ç (iY Æ jY)
- T (i Æ j)
- H (C Ç T)
62Computing the Probabilities
63Computing the Probabilities
- Pr(C x 5 L)
- 1 2 exp(-a L) 2 exp(-a (Lx))2 exp(-2
a (Lx)) - Pr(T x 5 L) exp(-2 a (Lx))
- Pr(H x 5 L) 1-1 2 exp(-a L) 2 exp(-a
(Lx))2
64Computing the Probabilities
65Final Estimator
66Chernoff Bound
- False Positives (d lt q L) Æ (x gt L)
- False Negatives (x lt q L) Æ (d gt L)
67Computing the Chernoff Bounds
68Yeast Mapping
69Steps in Mapping
70Data from One Experiment
71Expectation Maximization
72Map
73Local Distances
74Sequence Validation
75Sequence Validation
76Sequence Validation
77Sir Ernest Rutherford
- I have become more and more impressed by the
power of the scientific method of extending our
knowledge of nature. - Experiment, directed by the imagination of either
an individual, or still better of a group of
individuals of varied mental outlook is able to
achieve results which far transcend the
imagination alone of the greatest natural
philosopher.
78Sir Ernest Rutherford
- Experiment without imagination, or imagination
without recourse to experiment, can accomplish
little. But for effective progress, a happy blend
of these powers is necessary
79- Students
- Fang Chen
- Jiawu Feng
- Ofer Gill
- Matthias Heymann
- Iuliana Ionita
- Venkatesh P. Mysore
- Marina Spivak
- Bing Sun
- Yi (Joey) Zhou
- Visitors
- Marco Isopi
- Carla Piazza
- Alberto Policriti
- Naomi Silver
- Chris Wiggins
- Franz Winkler
- Principal Investigator
- Bud Mishra
- Researchers
- Marco Antoniotti
- Paolo Barbano
- Vera Cherepinsky
- Raoul-Sam Daruwala
- Gilad Lerman
- Joe McQuown
- Toto Paxia
- Archisman Rudra
- Nadia Ugel
- Alumni
- Will Casey
- Marc Rejali
80The End
- http//www.cs.nyu.edu/mishra
- http//bioinformatics.cat.nyu.edu
- Valis, Gene Grammar, NYU MAD, Cell Simulation,
81Other Ongoing Projects
- SINGLE MOLECULE MAPPING
- Single Molecule Genomics Optical Mapping,
Optical Sequencing RFLP Haplotyping - (In collaboration with Wisc Funded by NCI)
- ARRAY MAPPING
- (In collaboration with MIT Funded by NSF ITR)
- ARRAY CGH
- Microarray-based Genome Mapping--
- (In collaboration with NYU Med School CSHL ---
funded by NCI/NIH) - EXPRESSION DATA ANALYSIS
- (In collaboration with NYU Biology Med School
funded by NSF MHHI)
82SINGLE MOLECULE OPTICALMAPPING
83Error Sources
- Sizing Error
- (Bernoulli labeling, absorption cross-section,
PSF) - Partial Digestion
- False Optical Sites
- Orientation
- Spurious molecules, Optical chimerism, Calibration
Image of restriction enzyme digestedYAC clone
YAC clone 6H3, derived from human chromosome 11,
digested with the restriction endonuclease Eag I
and Mlu I, stained with a fluorochrome and imaged
by fluorescence microscopy.
84Optical MappingInterplay between Biology and
Computation
85Y
- From a genes point of view, reshuffling is a
great restorative - The Y, in its solitary state disapproves of such
laxity. Apart from small parts near each tip
which line up with a shared section of the X, it
stands aloof from the great DNA swap. Its genes,
such as they are, remain in purdah as the
generations succeed. As a result, each Y is a
genetic republic, insulated from the outside
world. Like most closed societies it becomes both
selfish and wasteful. Every lineage evolves an
identity of its own which, quite often, collapses
under the weight of its own inborn weaknesses. - Celibacy has ruined mans chromosome.
- Steve Jones, Y The descent of Men, 2002.
86Mapping the DAZ locus on Y Chromosome
87GCP is NP-Complete
- Transformation from Hamiltonian Path Problem
restricted to cubic graphs.
Choose p 3/4 k M
88NPCompleteness
- G has a Hamiltonian path
- v1, v2, vM
- Then, the admissible placement is
- D1, D2, DM
- with at most two intervals Ij Ij1
- overlapping with k cuts in common.
- Conversely, any admissible
- placement with a goodness gtk induces a
permutation p on the indices of the vertices of
G. - v(p(1)), v(p(2), , v(p(M))Hamiltonian
v1
v2
v3
D1
D2
D3
Consensus Map
89Experiment Design
- Relation among the error parameters
- 3b n p /4 5 k 5 n p4 q/2
- ) p (3 b/2 q)1/3
- Parameter choice for shotgun-mapping. Make the
partial digestion probability rather high (close
to 1) or the relative sizing error as low for
instance by using a rare cutter.
90Contour Plot as a Function of Sizing Error
(x-axis) and Digestion Rate (y-axis)
- The calculation is for the human genome, G 3,300
mb. - The average molecule length 5 mb, with an
overlap of 1 mb - The average restriction fragment length 25 kb
- For a sizing error of 3 kb, the required
digestion rate is 94 - If the sizing error is reduced to 2 kb, the
required digestion rate drops to 88 - If the sizing error is reduced to 1 kb, the
required digestion rate drops to 80 - (See Mathematica Demo)
91Gentigs Successes
- E. coli
- P. falciparum D. radiodurans Y. Pestis
- Rhodobacter sphaeroides Shigella flexneri
Salmonella enterica - Aspergillus fumigatus
-
- The automated Gentig system is routinely used
- to map microbe genomes quickly effortlessly
- by scientists with no quantitative or
computational training.
Shotgun Optical Mapping of Genomes
92VALISvast active living intelligence
system
93(No Transcript)
94Key Feature of Valis
- State of the art of rapid prototyping in
bioinformatics, functional genomics and systems
biology. - Multilanguage Scripting
- Data storage
- Graphical User interfaces
95Visual Genome
96Multi-Scripting
- A Valis script can be written in any supported
language - JScript, VBScript, Python, PERL, Lisp, R and
SETL. - All the scripts see the same Valis class
hierarchy. - For example, once a user learns that a Valis
Sequence Object has a method called Input that
will read the sequence from a file, the user can
subsequently use this same primitive from all the
different languages.
97Advantages
- We can take the best from each language
- Graph algorithms in SETL
- Sockets in Python
- Regular Expressions in Perl
- AI in Lisp
- Statistics in R
- ..
98Data Storage
- Based on Extended B-Trees
- At the lowest level there is an Heap of pages
- Must correctly keep track of the reference counts
of each record/object to implement value semantics
99B Tree Indexes
- Leaf pages contain data entries, and are chained
(prev next) - Non-leaf pages contain index entries and direct
searches
Non-leaf Pages
Leaf Pages
100Hardware
- Although Valis is designed to be used in
workstations, we can run the computation
intensive processes on Beowulf computing
servers.This cluster has - 16 compute nodes connected via a Gigabit Ethernet
and a low-latency, high speed network from
Dolphinics. - Cluster nodes are dual 2.4GHz Intel Xeon
processors with 4GB of memory. - Itis arranged in a 3D torus topology allowing
each node to communicate directly with its three
nearest neighbors. - The disk storage capacity for hosting our
databases is about one Terabyte. - The operating system is Linux and we use high
performance MPI libraries from Scali.
101Visualization
- Once the processing is completed, it is very
important to be able to quickly visualize the
results. - For this reason Valis provides numerous
visualization tools that allow a user to quickly
display - sequences, maps,
- microarray data,
- tables,
- graphs and annotations.
- These widget can be customized from the scripts.
102Valis Demo
- Bioinformatics 1.1, 1.2, 1.3
- Systems Biology Simple Pathway 2.1, 2.2, 2.3
- Systems Biology Apoptosis 2.4
- Simpathica
- XS-System
- BioWave
- BioSim
- NYUMAD
103SIMPATHICASystems Biology
How much of reasoning about biology can be
automated?
104Why do we need a tool?
We claim that, by drawing upon mathematical
approaches developed in the context of dynamical
systems, kinetic analysis, computational theory
and logic, it is possible to create powerful
simulation, analysis and reasoning tools for
working biologists to be used in deciphering
existing data, devising new experiments and
ultimately, understanding functional properties
of genomes, proteomes, cells, organs and
organisms.
Simulate Biologists! Not Biology!!
105Reasoning and Experimentation
106Simpathica is a modular system
Canonical Form
- Characteristics
- Predefined Modular Structure
- Automated Translation from Graphical to
Mathematical Model - Scalability
107Glycolysis
Glycogen
P_i
Glucose-1-P
Glucose
Phosphorylase a
Phosphoglucomutase
Glucokinase
Glucose-6-P
Phosphoglucose isomerase
Fructose-6-P
Phosphofructokinase
108Reaction Scheme for Wnt Signaling.
The reaction steps of the Wnt pathway are
numbered 1 to 19. Protein complexes are denoted
by the names of their components. Phosphorylated
components are marked by an asterisk.
Single-headed solid arrows characterize
irreversible reactions. Double-headed arrows
denote binding equilibria. Blue arrows mark
reactions that have only been taken into account
when studying the effect of high axin
concentrations.
109Broken arrows represent activation of Dsh by the
Wnt ligand (step 1), Dsh-mediated initiation of
the release of GSK3b from the destruction complex
(step 3), and APC-mediated degradation of axin
(step 15). The broken arrows indicate that the
components mediate but do not participate
stoichiometrically in the reaction scheme. The
irreversible reactions 2, 4, 5, 911, and 13 are
unimolecular, and reactions 6, 7, 8, 16, and 17
are reversible binding steps.
110Steady State Concentration
111\beta-catenin degradation
112Wnt Demo
- Systems Biology Wnt Pathway 3.1, 3.2
- SimpathicaA
113The Cell Cycle
G1
start
cell division
Cdk
Cdk
Cdk
Cyclin
S
M (anaphase)
APC
APC
finish
G2
M (metaphase)
114Cyclin B/Cdk and Cdh1/APC
- dCycB/dt
- k1 (k2 k2Cdh1)CycB
- dCdh1/dt
- (k3 k3 A) (1-Cdh1)/ (J31 Cdh1)
k4 m CycBCdh1/ (J4 Cdh1)
- A pair of nonlinear ODE (ordinary differential
equations) describing the biochemical reactions
at the center.
115Simulation of Yeast Cell Cycle
116Simulation of Yeast Cell Cycle
117Simulation of Yeast Cell Cycle
118The Natural Language Interface
119Story generation
- Temporal Logic formulae can be rendered in
English. - Temporal Logic formulae can be generated
automatically (with care). - Each formula can be tested against a set of
datasets differences can then be noted.
120Cell Cycle Story Generation Results (HTML
rendering)
Report on "Test Experiment Tyson WT, 1 Mutant, 2
Mutants.".RESULTSThe results refer to the
following datasets The first dataset is named
"Ian's Experiment/Tyson Yeast Dataset WT". The
second dataset is named "Ian's Experiment/Tyson
Yeast Dataset Mut1". The third dataset is named
"Ian's Experiment/Tyson Yeast Dataset mut2".
-
- CDH1 less than or equal to 1.0071783 will
always hold until CDH1 activates CYCB, is true
in the first dataset, is true in the second
dataset, and is false in the third dataset. - CDH1 represses CYCB implies CYCB is greater than
or equal to 0.65, is false in the first
dataset, is true in the second dataset, and is
true in the third dataset. - eventually, CDH1 is less than or equal to CYCB,
is false in the first dataset, is true in the
second dataset, and is true in the third
dataset. -
121GenomicsLarge Segmental Duplications
122Recent Segmental Duplications
Human
- 3.5 5 of the human genome is found to contain
- segmental duplications, with length gt 5 or 1kb,
identity gt 90. - August, 2001 assembly,
- Bailey, et al. 2002.
- April, 2003 assembly,
- Cheung, et al. 2003.
- These duplications are estimated to have emerged
about 40Mya under neutral assumption. - The duplications are mostly interspersed
(non-tandem), and happen both inter- and
intra-chromosomally.
From Bailey, et al. 2002
123Recent Segmental Duplications
Mouse
- 1.2 of the mouse genome is found to contain
segmental duplications, with length gt 5kb,
identity gt 90. - February, 2003 mouse assembly,
- Cheung, et al. 2003.
- These duplications are estimated to have emerged
about 25Mya under neutral assumption. - The duplications happen both inter- and
intra-chromosomally.
From Cheung, et al. 2003
124Statistical AnalysisDuplication Flanking
Sequences
- What are the molecular mechanisms that caused the
recent segmental duplications in the human and
mouse genomes? - Thermodynamic instability in the DNA sequences
- Recombination between homologous repeat elements
- Other unknown mechanisms.
125Thermodynamics
126Hypotheses
Scenario 1 Repeat-Mediated Homologous
recombination
Scenario 2 Preferential Repeat Insertion after
Duplication
Scenario 3 Artificial Boundary Effect from
Duplication Mapping
Duplicated segment
Duplicated segment
Duplicated segment
Overrepresentation of repeats in the flanking
regions
127The Model
128The Mathematical Model
h1 proportion of duplications by repeat
recombination h1 proportion of
duplications by recombination of the specific
repeat h1- - proportion of duplications
by recombination of other repeats h0 proportion
of duplications by other repeat-unrelated
mechanism h0 proportion of h0 with
common specific repeat in the flanking regions
h0- proportion of h0 with no common
specific repeat in the flanking regions
h0- - proportion of h0 with no specific repeat
in the flanking regions
a mutation rate in duplicated sequences ß
insertion rate of the specific repeat ?
mutation rate in the specific repeat d
divergence level of duplications e divergence
interval of duplications.
129Model Validation
Alu
L1
f - -
f - -
f -
f -
f
f
Diversity
Diversity
- The model parameters (aAlu, ßAlu, ?Alu, aL1, ßL1,
?L1) are estimated from the reported mutation and
insertion rates in the literature. - The relative strengths of the alternative
hypotheses can be estimated by model fitting to
the real data. - h1Alu 0.76 h1Alu 0.3 h1L1 0.76 h1 L1
0.35.
130ChIP-Chip Analysis
131Details (math)
- Idea in a nutshell (assume symmetric data)
- Throw out genes which deviate significantly at
various scales - Stop at exhausted scales (cubes)
- Threshold in stopping cubes by estimating
Cs(yx) - Average over shifted grids
132Procedure of Algorithm
- Recall Fixed dyadic grid along L
- Compute FQ (also fQ, ßQ,s Q) in top-down alg
-
- Stop at an interval if either
133Normalization.
- Recall setting (simplified) DataN2 matrix of
log. EVs - Problem systematic variation Different EVs
are recorded for same amount of mRNA. - Normalization Removal of variation to allow
balanced comparisons
134Normalization (continues)
- Related stat. terminology conditional mean
estimation previous terminology - Related math problem Construct a graph (or
chord-arc curve) a strip around it - Approach to solve math problem Combine ideas of
multiscale curve/graph constructions (Jones,
David and Semmes, L) with the multistrip
construction before
135ChIP-Chip Experiments I
136ChIP-Chip Experiments II
137ChIP-Chip Experiments III
138CARTWheelRedescription
139What is redescription?
- Shift of vocabulary from one language (descriptor
family) to another to describe the same entity - Descriptor is any meaningful way of defining a
subset within a universal set of entities - Set theoretic operations used on basic
descriptors to define derived descriptors - Evaluated on the basis of Jaccards coefficient
(A ltgt B) (A ? B) / (A U B)
140Why Redescribe?
- Allows feature construction
- Can handle any kind of data in terms of
descriptors no data specific mining required - Can find commonalities and differences between
various descriptors/descriptor families at the
same time - Can look for stories using a series of inexact
redescriptions
141CARTwheels algorithm for redescription
142CARTwheels algorithm for redescription
143Implementation details
- Simplified version of CARTwheels algorithm used
to speed up the process - Algorithm implemented in C on UNIX
- Visualization implemented in Java on UNIX
- Interacts with Postgres based database to extract
data/descriptors
144Implementation details descriptors used
- Experimental (microarray) data for yeast from
Gasch et al. Descriptors constructed of the form
gt, lt - 9 different stress used from Gasch et al. data
- GO category assignments for genes (biological
process, cellular component, molecular function)
145Design of System
146(No Transcript)
147(No Transcript)
148(No Transcript)
149(No Transcript)