Identifying Structural Variations Using Next Generation Sequencing Data - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Identifying Structural Variations Using Next Generation Sequencing Data

Description:

Identifying Structural Variations Using Next Generation Sequencing Data ... Assuming p(i)(Y) is a Gaussian with given mean and variance: ... – PowerPoint PPT presentation

Number of Views:248
Avg rating:3.0/5.0
Slides: 32
Provided by: Sera159
Category:

less

Transcript and Presenter's Notes

Title: Identifying Structural Variations Using Next Generation Sequencing Data


1
Identifying Structural Variations Using Next
Generation Sequencing Data
  • Seunghak Lee, Elango Cheran, Michael Brudno
  • Department of Computer Science
  • University of Toronto
  • ISMB 2008 SIG on NGS technologies

2
Computational Methods to Detect Structural
Variants
  • Direct comparison of genomes (Levy et al. 2007)
  • No limitation of resolutions
  • Expensive to assemble the whole genome
  • Unassembled clone-end data (Tuzun et al. 2005,
    Korbel et al. 2007)
  • Use of matepairs, part of data for assembling
    whole genome
  • Unable to exploit high coverage of reads
  • Probabilistic framework for clone-end data (Lee
    et al. 2008)
  • Based on unassembled clone-end data
  • Takes advantage of high coverage of reads

3
Outline
  • Probabilistic framework
  • New Measure for Clusters
  • Results

4
Outline
  • Probabilistic framework
  • New Measure for Clusters
  • Results

5
What are Matepairs?
DNA fragment
ATCAA
CTAAG
Insert size
6
Overview of Probabilistic Framework
  • We defined pairwise probabilities (Lee et al.
    2008)
  • P(Xi, Xjins), P(Xi, Xjdel), P(Xi, Xjinv),
    P(Xi, Xjtrans)

Xj
Xi
Donor
Xj
Xi
REF
7
Overview of Probabilistic Framework
  • We defined pairwise probabilities (Lee et al.
    2008)
  • P(Xi, Xjins), P(Xi, Xjdel), P(Xi, Xjinv),
    P(Xi, Xjtrans)

Xj
Xi
Donor
Xj
Xi
REF
8
Probabilistic Framework - Insertion
?
Xj
Xi
Donor
Xj
Xi
REF
9
Probabilistic Framework - Insertion
Xj
Xi
Donor
Xj
Xi
REF
10
Probabilistic Framework - Insertion
  • Let si mapped distance of Xi in REF
  • mi insert size of Xi

Estimated size of insertion explained by Xi is ri
si-mi
Xj
Xi
Donor
mi
Xj
Xi
REF
si
11
Probabilistic Framework - Insertion
  • Size of insertion

Xj
Xi
r
Donor
mj
mi
Xj
Xi
REF
sj
si
12
Probabilistic Framework - Insertion
  • si r average insert size of Xi

Xj
Xi
r
Donor
Insert size of Xi
mi
Xj
Xi
REF
si
13
Probabilistic Framework - Insertion
  • Now, we define with

Y random variable for insert size
p(Y)
(si r) is close to µ gt more likely insertion
si r
Average insert size of Xi
14
Probabilistic Framework - Insertion
p(Y)
(si r) is close to µ gt more likely insertion
si r
Average insert size of Xi
15
Probabilistic Framework - Insertion
  • Shaded area

p(Y)
(si r) is close to µ gt more likely insertion
µ - (si r)
si r
Average insert size of Xi
16
Outline
  • Probabilistic framework
  • New Measure for Clusters
  • Results

17
Motivation
  • Probabilities were defined by pairwise
    comparisons
  • High coverage of matepairs are now available from
    NGS technologies
  • Expect large number of matepairs in each cluster
  • We developed an alternative measure which takes
    into account all matepairs in each cluster
    together

18
Kullback-Leibler Divergence Overview
  • KL divergence is a measure of the difference of
    two probability distributions
  • KL divergence of Q from P
  • P true distribution
  • Q model distribution

19
A New Measure for Clusters (1)
p(Y)
  • p(Y) observed distribution of an insert size
  • q(Y) shifted distribution of p(Y) with mean of
    sr, where s is mapped distance and r is size of
    insertion

(s r) is close to µ gt more likely insertion
KL(pq) 0
µ
20
A New Measure for Clusters (2)
  • Define a new measure using KL divergence taking
    into account of all matepairs in cluster Ck
  • p(i)(Y) observed distribution of insert size of
    i-th matepair in Ck
  • q(i)(Y) shifted distribution of p(i)(Y) with
    mean of s(i)r

21
Optimizing New Measure
  • Assuming p(i)(Y) is a Gaussian with given mean
    and variance
  • Find optimum r by minimizing DKL(Ck) using
    gradient descent optimization method
  • Optimized DKL(Ck) is used as an alternative
    measure for a cluster Ck

22
Use of KL Divergence for Clustering
  • We cluster mappings of matepairs using
    Hierarchical Clustering
  • Linkage distance between Cu and Cv
  • Find two closest clusters if DKL(Cu,Cv)lt
    cutoff, then merge

Cv
Ck
Cu
REF
23
Outline
  • Probabilistic Framework
  • New Measure for Clusters
  • Results

24
Clustering Results
  • We started with 170,000,000 matepairs
  • 34.8 were uniquely mapped (93 for Sanger
    reads)
  • 97.5 had a concordant position
  • Through the clustering procedure we found (FDR
    0.05)
  • 152 Insertion clusters (4 had a uniquely mapped
    matepair)
  • 930 Deletion clusters (43)
  • 55 Inversion clusters (1)
  • 19 Translocation (cross-chromosome) cluster
    (all were required to have a uniquely mapped read)

25
Changes in Coverage of Deletion Clusters
  • Coverage of deletions
  • Homozygous No reads at locus of deletion
  • Heterozygous Half reads at locus of deletion

Chr1
Chr1
26
Example of Deletion Cluster
  • A deletion cluster in chr11,213,000-1,215,000
  • Contains 16 discordant matepairs
  • Overlaps deletion in Levy
  • Size of deletion 2,000bp

27
Distribution of Coverage of Deletions (1)
  • r1r2r3
  • C(ri) is the number of reads in region ri
  • If this is a real deletion, coverage in r2 should
    be less than average coverage of r1 and r3
  • We computed fraction of coverage of deletion to
    neighbors

deletion
REF
r2
r3
r1
F0 (Homozygous) F0.5 (Heterozygous)
28
Distribution of Coverage of Deletions (2)
Number of clusters
F
  • 699/930 (75) deletions have less than 0.7
    coverage of reads in neighbor regions (Flt0.7)

29
Agreement with Previous Results
  • All of the correlations are significant
    (p-values lt 0.001 via Monte Carlo)
  • Found 444 novel structural variants

30
Conclusions
  • Used a probabilistic framework for finding
    structural variants with NGS data
  • Isolated insertions, deletions, inversions and
    translocations between the reference public human
    genome and AB-SOLiD data
  • 75 of deletions supported by low read coverage
  • Strong correlation between our results and
    previous studies

31
Acknowledgments
  • Michael Brudno
  • Elango Cheran
  • Francisco de la Vega
  • Andy Pang
  • Lars Feuk
  • Stephen Scherer
  • Rest of UofT DSC CompBio Group

Applied Biosystems
Sick Kids Hospital
Write a Comment
User Comments (0)
About PowerShow.com