Title: Identifying Structural Variations Using Next Generation Sequencing Data
1Identifying Structural Variations Using Next
Generation Sequencing Data
- Seunghak Lee, Elango Cheran, Michael Brudno
- Department of Computer Science
- University of Toronto
- ISMB 2008 SIG on NGS technologies
2Computational Methods to Detect Structural
Variants
- Direct comparison of genomes (Levy et al. 2007)
- No limitation of resolutions
- Expensive to assemble the whole genome
- Unassembled clone-end data (Tuzun et al. 2005,
Korbel et al. 2007) - Use of matepairs, part of data for assembling
whole genome - Unable to exploit high coverage of reads
- Probabilistic framework for clone-end data (Lee
et al. 2008) - Based on unassembled clone-end data
- Takes advantage of high coverage of reads
3Outline
- Probabilistic framework
- New Measure for Clusters
- Results
4Outline
- Probabilistic framework
- New Measure for Clusters
- Results
5What are Matepairs?
DNA fragment
ATCAA
CTAAG
Insert size
6Overview of Probabilistic Framework
- We defined pairwise probabilities (Lee et al.
2008) - P(Xi, Xjins), P(Xi, Xjdel), P(Xi, Xjinv),
P(Xi, Xjtrans)
Xj
Xi
Donor
Xj
Xi
REF
7Overview of Probabilistic Framework
- We defined pairwise probabilities (Lee et al.
2008) - P(Xi, Xjins), P(Xi, Xjdel), P(Xi, Xjinv),
P(Xi, Xjtrans)
Xj
Xi
Donor
Xj
Xi
REF
8Probabilistic Framework - Insertion
?
Xj
Xi
Donor
Xj
Xi
REF
9Probabilistic Framework - Insertion
Xj
Xi
Donor
Xj
Xi
REF
10Probabilistic Framework - Insertion
- Let si mapped distance of Xi in REF
- mi insert size of Xi
Estimated size of insertion explained by Xi is ri
si-mi
Xj
Xi
Donor
mi
Xj
Xi
REF
si
11Probabilistic Framework - Insertion
Xj
Xi
r
Donor
mj
mi
Xj
Xi
REF
sj
si
12Probabilistic Framework - Insertion
- si r average insert size of Xi
Xj
Xi
r
Donor
Insert size of Xi
mi
Xj
Xi
REF
si
13Probabilistic Framework - Insertion
Y random variable for insert size
p(Y)
(si r) is close to µ gt more likely insertion
si r
Average insert size of Xi
14Probabilistic Framework - Insertion
p(Y)
(si r) is close to µ gt more likely insertion
si r
Average insert size of Xi
15Probabilistic Framework - Insertion
p(Y)
(si r) is close to µ gt more likely insertion
µ - (si r)
si r
Average insert size of Xi
16Outline
- Probabilistic framework
- New Measure for Clusters
- Results
17Motivation
- Probabilities were defined by pairwise
comparisons - High coverage of matepairs are now available from
NGS technologies - Expect large number of matepairs in each cluster
- We developed an alternative measure which takes
into account all matepairs in each cluster
together
18Kullback-Leibler Divergence Overview
- KL divergence is a measure of the difference of
two probability distributions - KL divergence of Q from P
- P true distribution
- Q model distribution
19A New Measure for Clusters (1)
p(Y)
- p(Y) observed distribution of an insert size
- q(Y) shifted distribution of p(Y) with mean of
sr, where s is mapped distance and r is size of
insertion
(s r) is close to µ gt more likely insertion
KL(pq) 0
µ
20A New Measure for Clusters (2)
- Define a new measure using KL divergence taking
into account of all matepairs in cluster Ck - p(i)(Y) observed distribution of insert size of
i-th matepair in Ck - q(i)(Y) shifted distribution of p(i)(Y) with
mean of s(i)r
21Optimizing New Measure
- Assuming p(i)(Y) is a Gaussian with given mean
and variance - Find optimum r by minimizing DKL(Ck) using
gradient descent optimization method - Optimized DKL(Ck) is used as an alternative
measure for a cluster Ck
22Use of KL Divergence for Clustering
- We cluster mappings of matepairs using
Hierarchical Clustering - Linkage distance between Cu and Cv
- Find two closest clusters if DKL(Cu,Cv)lt
cutoff, then merge
Cv
Ck
Cu
REF
23Outline
- Probabilistic Framework
- New Measure for Clusters
- Results
24Clustering Results
- We started with 170,000,000 matepairs
- 34.8 were uniquely mapped (93 for Sanger
reads) - 97.5 had a concordant position
- Through the clustering procedure we found (FDR
0.05) - 152 Insertion clusters (4 had a uniquely mapped
matepair) - 930 Deletion clusters (43)
- 55 Inversion clusters (1)
- 19 Translocation (cross-chromosome) cluster
(all were required to have a uniquely mapped read)
25Changes in Coverage of Deletion Clusters
- Coverage of deletions
- Homozygous No reads at locus of deletion
- Heterozygous Half reads at locus of deletion
-
Chr1
Chr1
26Example of Deletion Cluster
- A deletion cluster in chr11,213,000-1,215,000
- Contains 16 discordant matepairs
- Overlaps deletion in Levy
- Size of deletion 2,000bp
-
27Distribution of Coverage of Deletions (1)
- r1r2r3
- C(ri) is the number of reads in region ri
- If this is a real deletion, coverage in r2 should
be less than average coverage of r1 and r3 - We computed fraction of coverage of deletion to
neighbors
deletion
REF
r2
r3
r1
F0 (Homozygous) F0.5 (Heterozygous)
28Distribution of Coverage of Deletions (2)
Number of clusters
F
- 699/930 (75) deletions have less than 0.7
coverage of reads in neighbor regions (Flt0.7)
29Agreement with Previous Results
- All of the correlations are significant
(p-values lt 0.001 via Monte Carlo) - Found 444 novel structural variants
30Conclusions
- Used a probabilistic framework for finding
structural variants with NGS data - Isolated insertions, deletions, inversions and
translocations between the reference public human
genome and AB-SOLiD data - 75 of deletions supported by low read coverage
- Strong correlation between our results and
previous studies
31Acknowledgments
- Michael Brudno
- Elango Cheran
- Francisco de la Vega
- Andy Pang
- Lars Feuk
- Stephen Scherer
- Rest of UofT DSC CompBio Group
Applied Biosystems
Sick Kids Hospital