Title: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data
1Algorithmic Analysis of Human DNA Replication
Timing from Discrete Microarray Data
- Christopher Taylor
- Gabriel Robins Anindya Dutta
2Thesis Statement
- The DNA replication timing profile can be
reconstructed efficiently and accurately from
discrete time points.
(Glossary)
3Presentation Outline
- Biology background
- Microarray technology
- Experimental data
- Challenges
- Algorithms
- Research Plans
- Replication timing
- Origins
- Scale up
4Why Study DNA Replication?
- Natural Science
- DNA is the blueprint for organisms
- It must be passed on (organism, cell)
- Engineering
- Gene therapy
- Insertion, deletion, modification
- Cancer is unchecked replication
5... A G G T C G A C A C ... ... T C C
A G C T G T G ...
- Human genome gt 3 billion bp
- Replication rate 1000 bp/min
- Serial replication ? 5.7 years
- 6 to 10 hours (speedup gt 5000)
6Background
- Prokaryotes
- E. Coli
- DnaA binds to oriC
- Eukaryotes ORC
- S. Cerevisiae (yeast)
- ARS 11 bp consensus
- Mapping of origins
- Human
- No known consensus
- Few origins characterized
7Genome Tiling Microarrays
- Interrogation at genomic scale
- Large increase in data
- Microarray data analysis
- Array of probes tiles genome
ATGGACTACGGATCAGTAAATCGATTAGGCACCAGATCAAGTACGATCCA
GAGTACATAGCATACCATGACTAGATACCTGATGCCTAGTCATTTAGCTA
ATCCGTGGTCTAGTTCATGCTAGGTCTCATGTATCGTATGGTACTGATCT
GAGTACATAGCATACCATGACTAGACTCATGTATCGTATGGTACTGATCT
- Cross-hybridization
- Repeats not tiled
- Gaps in genome
PM probe
MM probe
GAGTACATAGCATACCATGACTAGA
A
8Image analysis computes intensity of each array
probe
9The Cell Cycle
Start of S-phase (0 hour)
S-Phase
10Profiling DNA Replication Timing
- Ideal f(chr, bp) rtime
- Isolate DNA replicated in discrete parts
of S-phase - One cell is not enough
- Synchronize S-phase entry
- Apply drugs
- Release together
- Synchronization error
- Label in two hour intervals
- Allelic Variation
- mf(chr, bp) rtime1, rtime2,
11Allelic Variation
0hr
0hr
2hr
2hr
- Fluorescent in-situ Hybridization (FISH)
- Replication timing at a given site
4hr
4hr
Temporally non-specific replication (TNS)
6hr
6hr
Temporally specific replication (TS)
8hr
8hr
10hr
10hr
11
12What is the Problem?
- Reconstruct a continuous replication profile
- Temporally (time points)
- Spatially (probes)
- from noisy data
- Biological experiments
- Synchronization error
- Microarray artifacts
- efficiently
- Genomic data (gt 3 billion bp)
13Initial Analysis
- Tiling Analysis Software (TAS)
- Wilcoxon Rank Sum test in sliding window
- Assess enrichment of treatment over control
- Window slides to get p-value for each probe
- O(kn) time complexity
- n probes on array
- k probes in a window
- k scales linearly with window size
14New Analysis
- Thesis Statement (revisited)
- The DNA replication timing profile can be
reconstructed efficiently and accurately from
discrete time points. - Incorporate information from all time points
- Continuous view of replication timing (TR50)
- Address temporally non-specific replication
- Scale up to the whole genome efficiently
15Allelic Variation Examples
Temporally specific replication
0 0
1/1 0
0
5
0
2
4
6
8
10
TR50
Temporally non-specific replication
1/6 1/6
1/3 0 1/3
0
2
4
6
8
10
5
TR50
Challenge From distribution of array signal,
determine replication category.
16Temporal Specificity Algorithm
- // Is there evidence that all alleles are
replicating together? - If (max sum of two adjacent time points 5/6
total sum) then probe is temporally specific - // Is at least one allele replicating apart from
the majority? - Else If (max sum of two adjacent time points not
including the maximum time point 1/3 total
sum) then probe is temporally
non-specific - // Isolated signal is not strong enough to be an
allele. - Else
probe is
temporally specific
17Plotting TR50
8 6 4 2 TR50 (hours)
33 33.5
34 Chromosomal Position
(in millions of bp)
- Smoothed TR50 curve recovers replication pattern
- Local minima ? Possible locations of replication
origin
18Segregation Algorithm
- Sliding window passes over probes to generate
intervals - Ratio of TSP to TNSP determines temporal
specificity - Average TR50 determines timing category
19Research Plan Profile Generation
- Parameters to evaluate
- Segregation Algorithm sliding window size,
minimum probe density - Join Intervals minimum interval size
20Evaluation
- Concordance of biological phenomena
- Segregation intervals ? FISH
- STR50 local minima ? Other origin methods
- Correlation with other biological data
- Gene density ? Early replication
- AT content ? Late replication
- Gene expression ? Early replication
- Activating acetylation/methylation ? Early
replication - Performance on random data
- Large quantity of TNS replication
21Research Plan Replication Origins
- Drive DNA replication pattern
- Smoothed TR50 local minima
- Cleaned up with new profiles
- Other biological assays
- Early labeling fragments
- Nascent strands
- Bubble trapping
- ORC binding
22Approach and Evaluation
- Correlation between methods
- Consensus sets
- Motif analysis
- Positional attributes
- Replication timing
- Proximity to genes
- Evaluation is difficult (few validated origins)
- Agreement between methods
- Testing proposed correlations
- Paper in preparation
23Scaling Up to Whole Genome
- Pilot 1 100 of human genome
- Algorithms developed with scalability in mind
- Incremental update sliding windows ? Linear time
- Performance based evaluation
- If 100 data available
- Profile multiple runs
- Else
- Profile many 1 runs
24Implementation Details
- Java
- Class representation of proprietary microarray
files - Algorithms to process raw microarray data
- Diagnostic tools
- Perl
- Scripts to process intermediate and final data
- Correlations, data transformation, quality
assurance - R statistical language
- Smoothing, statistical plots, correlation studies
- Shell scripts
- Automated processing of microarray sets
25Current/Expected Contributions
- Algorithms, Software Infrastructure, Analysis
- Probe-by-probe TR50 analysis
- Temporal Specificity Algorithm
- Combinatorial analysis of allele locations
- Segregation Algorithm
- TNS, Early, Mid, Late replicating areas
- Used to design validation experiments
- Smoothed TR50 profile
- Local minima provide candidate origin set
- Linear algorithms enable scale up
- Randomness testing
26Publications
- Completed
- ENCODE Project Consortium. The ENCODE
(ENCyclopedia Of DNA Elements) Project. Science.
2004 Oct 22 306(5696)636-40. - ENCODE Project Consortium. Identification and
analysis of functional elements in 1 of the
human genome by the ENCODE pilot project.
Nature. In Press, to appear in June 14,
2007 issue - Karnani N., Taylor C., Malhotra A., Dutta A.
Pan-S replication patterns and chromosomal
domains defined by genome tiling arrays of encode
genomic areas. Genome Research.
In Press, to appear in June
2007 issue - UCSC Browser Tracks
TR50, Smoothed TR50, Local Minima,
Segregation - In Progress
- Multi-million dollar NIH grant for scale up to
full human genome - Paper detailing origin methods, correlations, etc.
27Timeline
- Spring 2007 (present to June 20)
- Implement proposed replication profile
generation algorithms - Generate new profiles for existing data and
evaluate against FISH - Collect new origin sets and continue analysis for
paper completion - Summer 2007 (June 21 to September 21)
- Explore correlations of new profiles with other
data sets - Submit paper to PSB 2008 based on new method and
results - Develop random data sets to test profile
generation algorithms - Fall 2007 (September 22 to December 21)
- Evaluate performance for scale up to whole
genome - Tie up loose ends and begin writing the
dissertation - Winter 2007-2008 (December 22 to March 19)
- Finish dissertation and schedule defense before
May 2008
28Acknowledgements
- Advising
- Anindya Dutta, Gabriel Robins
- Biological Experiments
- Neerja Karnani, Patrick Boyle, Larry Mesner,
Jamie Teer, Hakkyun Kim - Collaborative Analysis
- Ankit Malhotra
- Discussions of Analysis
- Stefan Bekiranov
29THE END
30Why is this work computer science?
- Fred Brooks The Computer Scientist as Toolsmith
II - Hitching our research to someone elses driving
problems, and solving those problems on the
owners terms, leads us to richer computer
science research. - Not an incremental improvement
- Algorithmic techniques and analysis used to solve
a problem previously addressed inadequately with
a statistical approach that performed poorly - Collaboration outside of engineering disciplines
enhances visibility, funding opportunities, and
demand for CS work - Developed algorithms, time complexity analysis,
combinatorial analysis, feedback to experimental
design
31Will this work lead to any CS publications?
- The Nature article focused on analysis of the
biological data and includes descriptions of some
of my algorithms - The Genome Research paper and origins paper will
also contain writeups of my algorithms and
analysis techniques - The Pacific Symposium on Biocomputing focuses on
algorithms and computational techniques
32Isn't your approach too simple?
- The approach isnt simple
- Combinatorial analysis
- Temporal specificity algorithm (many iterations)
- Probewise computation to deal with binding
affinity - Incremental updating sliding windows
- Cross-hybridiztion
- Synchronization error
- Smoothing
- Parameterization
- Linear algorithms for scale up
33Can't your algorithm be replaced by a well-known
statistical method?
- HMMs were used for segregation of intervals
- Performed poorly in comparison to my algorithm
- Less accurate categorization of replication
intervals - Prone to rapid oscillation, producing tiny
intervals - Parameterization was difficult
- Lowess smoothing is a statistical method
- Parameterization was not easy
34What are the biggest challenges in this work?
- Noise!
- The data to analyze comes from biological
experiments with several sources of noise that
compound upon one another - Biology
- I havent had a course in biology since 10th
grade - Microarrays
- New, evolving technology were still learning to
deal with - Data size
- Hundreds of GB of data to process
- Replicates, failed experiments
- Algorithms must be efficient
35What kind of career are you aiming for after
graduation, and why?
- Teaching Computer Science (Small College)
- I enjoyed learning in my undergraduate curriculum
with meaningful interactions with professors - I taught Discrete Math at UVa in Fall 02 and
Spring 03 - Enjoyable, but 60-70 students too large
- Post-doctoral (Biological Computing)
- Many opportunities around the world
- Further exploration of the field
36How will you know when your work/thesis is done?
- Research is never really done, but you have to
declare victory at some point - The replication profiling algorithms Ive
developed already perform quite well - I have concrete plans to improve and finalize
them