Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data

1
Algorithmic Analysis of Human DNA Replication
Timing from Discrete Microarray Data

Christopher Taylor
Gabriel Robins Anindya Dutta

2
Thesis Statement

The DNA replication timing profile can be
reconstructed efficiently and accurately from
discrete time points.

(Glossary)
3
Presentation Outline

Biology background
Microarray technology
Experimental data
Challenges
Algorithms
Research Plans
Replication timing
Origins
Scale up

4
Why Study DNA Replication?

Natural Science
DNA is the blueprint for organisms
It must be passed on (organism, cell)
Engineering
Gene therapy
Insertion, deletion, modification
Cancer is unchecked replication

5
... A G G T C G A C A C ... ... T C C
A G C T G T G ...

Human genome gt 3 billion bp
Replication rate 1000 bp/min
Serial replication ? 5.7 years
6 to 10 hours (speedup gt 5000)

6
Background

Prokaryotes
E. Coli
DnaA binds to oriC
Eukaryotes ORC
S. Cerevisiae (yeast)
ARS 11 bp consensus
Mapping of origins
Human
No known consensus
Few origins characterized

7
Genome Tiling Microarrays

Interrogation at genomic scale
Large increase in data
Microarray data analysis
Array of probes tiles genome

ATGGACTACGGATCAGTAAATCGATTAGGCACCAGATCAAGTACGATCCA
GAGTACATAGCATACCATGACTAGATACCTGATGCCTAGTCATTTAGCTA
ATCCGTGGTCTAGTTCATGCTAGGTCTCATGTATCGTATGGTACTGATCT
GAGTACATAGCATACCATGACTAGACTCATGTATCGTATGGTACTGATCT

Cross-hybridization
Repeats not tiled
Gaps in genome

PM probe
MM probe
GAGTACATAGCATACCATGACTAGA
A
8
Image analysis computes intensity of each array
probe
9
The Cell Cycle
Start of S-phase (0 hour)
S-Phase
10
Profiling DNA Replication Timing

Ideal f(chr, bp) rtime
Isolate DNA replicated in discrete parts
of S-phase
One cell is not enough
Synchronize S-phase entry
Apply drugs
Release together
Synchronization error
Label in two hour intervals
Allelic Variation
mf(chr, bp) rtime1, rtime2,

11
Allelic Variation
0hr
0hr
2hr
2hr

Fluorescent in-situ Hybridization (FISH)
Replication timing at a given site

4hr
4hr
Temporally non-specific replication (TNS)
6hr
6hr
Temporally specific replication (TS)
8hr
8hr
10hr
10hr
11
12
What is the Problem?

Reconstruct a continuous replication profile
Temporally (time points)
Spatially (probes)
from noisy data
Biological experiments
Synchronization error
Microarray artifacts
efficiently
Genomic data (gt 3 billion bp)

13
Initial Analysis

Tiling Analysis Software (TAS)
Wilcoxon Rank Sum test in sliding window
Assess enrichment of treatment over control

Window slides to get p-value for each probe
O(kn) time complexity
n probes on array
k probes in a window
k scales linearly with window size

14
New Analysis

Thesis Statement (revisited)
The DNA replication timing profile can be
reconstructed efficiently and accurately from
discrete time points.
Incorporate information from all time points
Continuous view of replication timing (TR50)
Address temporally non-specific replication
Scale up to the whole genome efficiently

15
Allelic Variation Examples
Temporally specific replication
0 0
1/1 0
0
5
0
2
4
6
8
10
TR50
Temporally non-specific replication
1/6 1/6
1/3 0 1/3
0
2
4
6
8
10
5
TR50
Challenge From distribution of array signal,
determine replication category.
16
Temporal Specificity Algorithm

// Is there evidence that all alleles are
replicating together?
If (max sum of two adjacent time points 5/6
total sum) then probe is temporally specific
// Is at least one allele replicating apart from
the majority?
Else If (max sum of two adjacent time points not
including the maximum time point 1/3 total
sum) then probe is temporally
non-specific
// Isolated signal is not strong enough to be an
allele.
Else
probe is
temporally specific

17
Plotting TR50
8 6 4 2 TR50 (hours)
33 33.5
34 Chromosomal Position
(in millions of bp)

Smoothed TR50 curve recovers replication pattern
Local minima ? Possible locations of replication
origin

18
Segregation Algorithm

Sliding window passes over probes to generate
intervals
Ratio of TSP to TNSP determines temporal
specificity
Average TR50 determines timing category

19
Research Plan Profile Generation

Parameters to evaluate
Segregation Algorithm sliding window size,
minimum probe density
Join Intervals minimum interval size

20
Evaluation

Concordance of biological phenomena
Segregation intervals ? FISH
STR50 local minima ? Other origin methods
Correlation with other biological data
Gene density ? Early replication
AT content ? Late replication
Gene expression ? Early replication
Activating acetylation/methylation ? Early
replication
Performance on random data
Large quantity of TNS replication

21
Research Plan Replication Origins

Drive DNA replication pattern
Smoothed TR50 local minima
Cleaned up with new profiles
Other biological assays
Early labeling fragments
Nascent strands
Bubble trapping
ORC binding

22
Approach and Evaluation

Correlation between methods
Consensus sets
Motif analysis
Positional attributes
Replication timing
Proximity to genes
Evaluation is difficult (few validated origins)
Agreement between methods
Testing proposed correlations
Paper in preparation

23
Scaling Up to Whole Genome

Pilot 1 100 of human genome
Algorithms developed with scalability in mind
Incremental update sliding windows ? Linear time
Performance based evaluation
If 100 data available
Profile multiple runs
Else
Profile many 1 runs

24
Implementation Details

Java
Class representation of proprietary microarray
files
Algorithms to process raw microarray data
Diagnostic tools
Perl
Scripts to process intermediate and final data
Correlations, data transformation, quality
assurance
R statistical language
Smoothing, statistical plots, correlation studies
Shell scripts
Automated processing of microarray sets

25
Current/Expected Contributions

Algorithms, Software Infrastructure, Analysis
Probe-by-probe TR50 analysis
Temporal Specificity Algorithm
Combinatorial analysis of allele locations
Segregation Algorithm
TNS, Early, Mid, Late replicating areas
Used to design validation experiments
Smoothed TR50 profile
Local minima provide candidate origin set
Linear algorithms enable scale up
Randomness testing

26
Publications

Completed
ENCODE Project Consortium. The ENCODE
(ENCyclopedia Of DNA Elements) Project. Science.
2004 Oct 22 306(5696)636-40.
ENCODE Project Consortium. Identification and
analysis of functional elements in 1 of the
human genome by the ENCODE pilot project.
Nature. In Press, to appear in June 14,
2007 issue
Karnani N., Taylor C., Malhotra A., Dutta A.
Pan-S replication patterns and chromosomal
domains defined by genome tiling arrays of encode
genomic areas. Genome Research.
In Press, to appear in June
2007 issue
UCSC Browser Tracks
TR50, Smoothed TR50, Local Minima,
Segregation
In Progress
Multi-million dollar NIH grant for scale up to
full human genome
Paper detailing origin methods, correlations, etc.

27
Timeline

Spring 2007 (present to June 20)
Implement proposed replication profile
generation algorithms
Generate new profiles for existing data and
evaluate against FISH
Collect new origin sets and continue analysis for
paper completion
Summer 2007 (June 21 to September 21)
Explore correlations of new profiles with other
data sets
Submit paper to PSB 2008 based on new method and
results
Develop random data sets to test profile
generation algorithms
Fall 2007 (September 22 to December 21)
Evaluate performance for scale up to whole
genome
Tie up loose ends and begin writing the
dissertation
Winter 2007-2008 (December 22 to March 19)
Finish dissertation and schedule defense before
May 2008

28
Acknowledgements

Advising
Anindya Dutta, Gabriel Robins
Biological Experiments
Neerja Karnani, Patrick Boyle, Larry Mesner,
Jamie Teer, Hakkyun Kim
Collaborative Analysis
Ankit Malhotra
Discussions of Analysis
Stefan Bekiranov

29
THE END
30
Why is this work computer science?

Fred Brooks The Computer Scientist as Toolsmith
II
Hitching our research to someone elses driving
problems, and solving those problems on the
owners terms, leads us to richer computer
science research.
Not an incremental improvement
Algorithmic techniques and analysis used to solve
a problem previously addressed inadequately with
a statistical approach that performed poorly
Collaboration outside of engineering disciplines
enhances visibility, funding opportunities, and
demand for CS work
Developed algorithms, time complexity analysis,
combinatorial analysis, feedback to experimental
design

31
Will this work lead to any CS publications?

The Nature article focused on analysis of the
biological data and includes descriptions of some
of my algorithms
The Genome Research paper and origins paper will
also contain writeups of my algorithms and
analysis techniques
The Pacific Symposium on Biocomputing focuses on
algorithms and computational techniques

32
Isn't your approach too simple?

The approach isnt simple
Combinatorial analysis
Temporal specificity algorithm (many iterations)
Probewise computation to deal with binding
affinity
Incremental updating sliding windows
Cross-hybridiztion
Synchronization error
Smoothing
Parameterization
Linear algorithms for scale up

33
Can't your algorithm be replaced by a well-known
statistical method?

HMMs were used for segregation of intervals
Performed poorly in comparison to my algorithm
Less accurate categorization of replication
intervals
Prone to rapid oscillation, producing tiny
intervals
Parameterization was difficult
Lowess smoothing is a statistical method
Parameterization was not easy

34
What are the biggest challenges in this work?

Noise!
The data to analyze comes from biological
experiments with several sources of noise that
compound upon one another
Biology
I havent had a course in biology since 10th
grade
Microarrays
New, evolving technology were still learning to
deal with
Data size
Hundreds of GB of data to process
Replicates, failed experiments
Algorithms must be efficient

35
What kind of career are you aiming for after
graduation, and why?

Teaching Computer Science (Small College)
I enjoyed learning in my undergraduate curriculum
with meaningful interactions with professors
I taught Discrete Math at UVa in Fall 02 and
Spring 03
Enjoyable, but 60-70 students too large
Post-doctoral (Biological Computing)
Many opportunities around the world
Further exploration of the field

Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data PowerPoint PPT Presentation