Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data


1
Algorithmic Analysis of Human DNA Replication
Timing from Discrete Microarray Data
  • Christopher Taylor
  • Gabriel Robins Anindya Dutta

2
Thesis Statement
  • The DNA replication timing profile can be
    reconstructed efficiently and accurately from
    discrete time points.

(Glossary)
3
Presentation Outline
  • Biology background
  • Microarray technology
  • Experimental data
  • Challenges
  • Algorithms
  • Research Plans
  • Replication timing
  • Origins
  • Scale up

4
Why Study DNA Replication?
  • Natural Science
  • DNA is the blueprint for organisms
  • It must be passed on (organism, cell)
  • Engineering
  • Gene therapy
  • Insertion, deletion, modification
  • Cancer is unchecked replication

5
... A G G T C G A C A C ... ... T C C
A G C T G T G ...
  • Human genome gt 3 billion bp
  • Replication rate 1000 bp/min
  • Serial replication ? 5.7 years
  • 6 to 10 hours (speedup gt 5000)

6
Background
  • Prokaryotes
  • E. Coli
  • DnaA binds to oriC
  • Eukaryotes ORC
  • S. Cerevisiae (yeast)
  • ARS 11 bp consensus
  • Mapping of origins
  • Human
  • No known consensus
  • Few origins characterized

7
Genome Tiling Microarrays
  • Interrogation at genomic scale
  • Large increase in data
  • Microarray data analysis
  • Array of probes tiles genome

ATGGACTACGGATCAGTAAATCGATTAGGCACCAGATCAAGTACGATCCA
GAGTACATAGCATACCATGACTAGATACCTGATGCCTAGTCATTTAGCTA
ATCCGTGGTCTAGTTCATGCTAGGTCTCATGTATCGTATGGTACTGATCT
GAGTACATAGCATACCATGACTAGACTCATGTATCGTATGGTACTGATCT
  • Cross-hybridization
  • Repeats not tiled
  • Gaps in genome

PM probe
MM probe
GAGTACATAGCATACCATGACTAGA
A
8
Image analysis computes intensity of each array
probe
9
The Cell Cycle
Start of S-phase (0 hour)
S-Phase
10
Profiling DNA Replication Timing
  • Ideal f(chr, bp) rtime
  • Isolate DNA replicated in discrete parts
    of S-phase
  • One cell is not enough
  • Synchronize S-phase entry
  • Apply drugs
  • Release together
  • Synchronization error
  • Label in two hour intervals
  • Allelic Variation
  • mf(chr, bp) rtime1, rtime2,

11
Allelic Variation
0hr
0hr
2hr
2hr
  • Fluorescent in-situ Hybridization (FISH)
  • Replication timing at a given site

4hr
4hr
Temporally non-specific replication (TNS)
6hr
6hr
Temporally specific replication (TS)
8hr
8hr
10hr
10hr
11
12
What is the Problem?
  • Reconstruct a continuous replication profile
  • Temporally (time points)
  • Spatially (probes)
  • from noisy data
  • Biological experiments
  • Synchronization error
  • Microarray artifacts
  • efficiently
  • Genomic data (gt 3 billion bp)

13
Initial Analysis
  • Tiling Analysis Software (TAS)
  • Wilcoxon Rank Sum test in sliding window
  • Assess enrichment of treatment over control
  • Window slides to get p-value for each probe
  • O(kn) time complexity
  • n probes on array
  • k probes in a window
  • k scales linearly with window size

14
New Analysis
  • Thesis Statement (revisited)
  • The DNA replication timing profile can be
    reconstructed efficiently and accurately from
    discrete time points.
  • Incorporate information from all time points
  • Continuous view of replication timing (TR50)
  • Address temporally non-specific replication
  • Scale up to the whole genome efficiently

15
Allelic Variation Examples
Temporally specific replication
0 0
1/1 0
0
5
0
2
4
6
8
10
TR50
Temporally non-specific replication
1/6 1/6
1/3 0 1/3
0
2
4
6
8
10
5
TR50
Challenge From distribution of array signal,
determine replication category.
16
Temporal Specificity Algorithm
  • // Is there evidence that all alleles are
    replicating together?
  • If (max sum of two adjacent time points 5/6
    total sum) then probe is temporally specific
  • // Is at least one allele replicating apart from
    the majority?
  • Else If (max sum of two adjacent time points not
    including the maximum time point 1/3 total
    sum) then probe is temporally
    non-specific
  • // Isolated signal is not strong enough to be an
    allele.
  • Else
    probe is
    temporally specific

17
Plotting TR50
8 6 4 2 TR50 (hours)
33 33.5
34 Chromosomal Position
(in millions of bp)
  • Smoothed TR50 curve recovers replication pattern
  • Local minima ? Possible locations of replication
    origin

18
Segregation Algorithm
  • Sliding window passes over probes to generate
    intervals
  • Ratio of TSP to TNSP determines temporal
    specificity
  • Average TR50 determines timing category

19
Research Plan Profile Generation
  • Parameters to evaluate
  • Segregation Algorithm sliding window size,
    minimum probe density
  • Join Intervals minimum interval size

20
Evaluation
  • Concordance of biological phenomena
  • Segregation intervals ? FISH
  • STR50 local minima ? Other origin methods
  • Correlation with other biological data
  • Gene density ? Early replication
  • AT content ? Late replication
  • Gene expression ? Early replication
  • Activating acetylation/methylation ? Early
    replication
  • Performance on random data
  • Large quantity of TNS replication

21
Research Plan Replication Origins
  • Drive DNA replication pattern
  • Smoothed TR50 local minima
  • Cleaned up with new profiles
  • Other biological assays
  • Early labeling fragments
  • Nascent strands
  • Bubble trapping
  • ORC binding

22
Approach and Evaluation
  • Correlation between methods
  • Consensus sets
  • Motif analysis
  • Positional attributes
  • Replication timing
  • Proximity to genes
  • Evaluation is difficult (few validated origins)
  • Agreement between methods
  • Testing proposed correlations
  • Paper in preparation

23
Scaling Up to Whole Genome
  • Pilot 1 100 of human genome
  • Algorithms developed with scalability in mind
  • Incremental update sliding windows ? Linear time
  • Performance based evaluation
  • If 100 data available
  • Profile multiple runs
  • Else
  • Profile many 1 runs

24
Implementation Details
  • Java
  • Class representation of proprietary microarray
    files
  • Algorithms to process raw microarray data
  • Diagnostic tools
  • Perl
  • Scripts to process intermediate and final data
  • Correlations, data transformation, quality
    assurance
  • R statistical language
  • Smoothing, statistical plots, correlation studies
  • Shell scripts
  • Automated processing of microarray sets

25
Current/Expected Contributions
  • Algorithms, Software Infrastructure, Analysis
  • Probe-by-probe TR50 analysis
  • Temporal Specificity Algorithm
  • Combinatorial analysis of allele locations
  • Segregation Algorithm
  • TNS, Early, Mid, Late replicating areas
  • Used to design validation experiments
  • Smoothed TR50 profile
  • Local minima provide candidate origin set
  • Linear algorithms enable scale up
  • Randomness testing

26
Publications
  • Completed
  • ENCODE Project Consortium. The ENCODE
    (ENCyclopedia Of DNA Elements) Project. Science.
    2004 Oct 22 306(5696)636-40.
  • ENCODE Project Consortium. Identification and
    analysis of functional elements in 1 of the
    human genome by the ENCODE pilot project.
    Nature. In Press, to appear in June 14,
    2007 issue
  • Karnani N., Taylor C., Malhotra A., Dutta A.
    Pan-S replication patterns and chromosomal
    domains defined by genome tiling arrays of encode
    genomic areas. Genome Research.
    In Press, to appear in June
    2007 issue
  • UCSC Browser Tracks
    TR50, Smoothed TR50, Local Minima,
    Segregation
  • In Progress
  • Multi-million dollar NIH grant for scale up to
    full human genome
  • Paper detailing origin methods, correlations, etc.

27
Timeline
  • Spring 2007 (present to June 20)
  • Implement proposed replication profile
    generation algorithms
  • Generate new profiles for existing data and
    evaluate against FISH
  • Collect new origin sets and continue analysis for
    paper completion
  • Summer 2007 (June 21 to September 21)
  • Explore correlations of new profiles with other
    data sets
  • Submit paper to PSB 2008 based on new method and
    results
  • Develop random data sets to test profile
    generation algorithms
  • Fall 2007 (September 22 to December 21)
  • Evaluate performance for scale up to whole
    genome
  • Tie up loose ends and begin writing the
    dissertation
  • Winter 2007-2008 (December 22 to March 19)
  • Finish dissertation and schedule defense before
    May 2008

28
Acknowledgements
  • Advising
  • Anindya Dutta, Gabriel Robins
  • Biological Experiments
  • Neerja Karnani, Patrick Boyle, Larry Mesner,
    Jamie Teer, Hakkyun Kim
  • Collaborative Analysis
  • Ankit Malhotra
  • Discussions of Analysis
  • Stefan Bekiranov

29
THE END
30
Why is this work computer science?
  • Fred Brooks The Computer Scientist as Toolsmith
    II
  • Hitching our research to someone elses driving
    problems, and solving those problems on the
    owners terms, leads us to richer computer
    science research.
  • Not an incremental improvement
  • Algorithmic techniques and analysis used to solve
    a problem previously addressed inadequately with
    a statistical approach that performed poorly
  • Collaboration outside of engineering disciplines
    enhances visibility, funding opportunities, and
    demand for CS work
  • Developed algorithms, time complexity analysis,
    combinatorial analysis, feedback to experimental
    design

31
Will this work lead to any CS publications?
  • The Nature article focused on analysis of the
    biological data and includes descriptions of some
    of my algorithms
  • The Genome Research paper and origins paper will
    also contain writeups of my algorithms and
    analysis techniques
  • The Pacific Symposium on Biocomputing focuses on
    algorithms and computational techniques

32
Isn't your approach too simple?
  • The approach isnt simple
  • Combinatorial analysis
  • Temporal specificity algorithm (many iterations)
  • Probewise computation to deal with binding
    affinity
  • Incremental updating sliding windows
  • Cross-hybridiztion
  • Synchronization error
  • Smoothing
  • Parameterization
  • Linear algorithms for scale up

33
Can't your algorithm be replaced by a well-known
statistical method?
  • HMMs were used for segregation of intervals
  • Performed poorly in comparison to my algorithm
  • Less accurate categorization of replication
    intervals
  • Prone to rapid oscillation, producing tiny
    intervals
  • Parameterization was difficult
  • Lowess smoothing is a statistical method
  • Parameterization was not easy

34
What are the biggest challenges in this work?
  • Noise!
  • The data to analyze comes from biological
    experiments with several sources of noise that
    compound upon one another
  • Biology
  • I havent had a course in biology since 10th
    grade
  • Microarrays
  • New, evolving technology were still learning to
    deal with
  • Data size
  • Hundreds of GB of data to process
  • Replicates, failed experiments
  • Algorithms must be efficient

35
What kind of career are you aiming for after
graduation, and why?
  • Teaching Computer Science (Small College)
  • I enjoyed learning in my undergraduate curriculum
    with meaningful interactions with professors
  • I taught Discrete Math at UVa in Fall 02 and
    Spring 03
  • Enjoyable, but 60-70 students too large
  • Post-doctoral (Biological Computing)
  • Many opportunities around the world
  • Further exploration of the field

36
How will you know when your work/thesis is done?
  • Research is never really done, but you have to
    declare victory at some point
  • The replication profiling algorithms Ive
    developed already perform quite well
  • I have concrete plans to improve and finalize
    them
Write a Comment
User Comments (0)
About PowerShow.com