SNP and mutation discovery using basespecific cleavage and MALDITOF mass spectrometry - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

SNP and mutation discovery using basespecific cleavage and MALDITOF mass spectrometry

Description:

Preprocessing: AGCGTTA. AGC G2CG ... n=3 ... preprocessing motivation ... the bounded compomers found in preprocessing step will be minimal distance from c' ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 40

Provided by: markch2

Category:

more less

Transcript and Presenter's Notes

Title: SNP and mutation discovery using basespecific cleavage and MALDITOF mass spectrometry

1
SNP and mutation discovery using base-specific
cleavage and MALDI-TOF mass spectrometry

Sebastian Böcker, Bioinformatics (2003) 19,
i44-i53
Presented by Mark Chaisson, Bioinformatics

2
Outline

Goals
Data acquisition
Data types (compomers) Operations
Methods
Results
Discussion

3
Goal SNP and mutation discovery

Biological goal discover what SNPs are present
in a sequence with a known reference strand, in
regions that may have clusters of SNPs.
Computational goal find what (minimal) changes
are necessary to transform a reference spectrum
to a measured one.

4
Data acquisition

Target (sample) DNA strand 100-1000 nt.,
amplified by PCR.
Homozygous with respect to sample (all SNPs
belong to the same strand).
Base specific cleavage through RNAse A, and
selective transcription of forward or reverse
strand of sample DNA.
This method has been submitted for publication

5
SEQUENOM MassARRAY

SNP discovery
SNP genotyping
Allele frequency
Gene expression
...

6
Data acquisition

Total digestion

7
Data acquisition annotated mass spectrum from T1
digestion
8
Data acquisition peak selection

An in-silico prediction of the template (No SNP)
is compared with the measured template.
Differences between sample spectra and predicted
are due to mutations.
Mass can give information about nucleotide
counts, but not order.

9
Data types compomers

s sequence (TGGTCACT)
c compomer, nucleic acid composition of a
sequence
c comp(s) (Ai,Cj,Tk,Gl),
comp(TGGTCACT) G2C2A1T3
Ai i, c net size of c

10
Operations string, compomer spectra

Given a sample string s, a cut string x
string spectrum S0(s) the set of strings
bounded by x, or the ends of s, and do not
contain x.
compomer spectrum C0(s) compomers of S0.
Examples ACATGTGCCATTA, x T, S0 ACA,
G, GCCA, ?, A C0 A2C, G1, G1C2A1, A1

11
Operations order of a string

ordx(s) the number of times x appears as a
substring of s.
Most of the time x is one nucleotide.
Base specific endonuclease.

12
Operations count operators

define c? 0(?) maxc(?), 0
c? 0(3) 3 c? 0(-1) 0
define c?0(?) minc(?), 0
c?0(3) 0 c?0(-1) -1
Negative counts arise in (c - c)
Natural compomer c c? 0
comp(s) is always natural

13
Operations distance

When comparing strings (fragments) dL(s,s)
Levenshtein (edit) distance
Comparing compomers d(c,c) max
(c-c)?0, (c-c)?0
Note fragments y, y with compomers c, c have
dL(y, y) ? d(c,c)

14
Distance between compomers can characterize edit
operations ? c - c
15
Operations SNP fragment explanation

Given a fragment y ? ?, and a compomer c from
a different (unknown) fragment, y let c
comp(y) E(y, c, c) function that generates
all fragments y with comp(y) c, and
dL(y,y) d(c,c)
Uses information from ? ?0 and ??0 to
determine what operations to perform on y.

16
Operationsbound determination

Given a string s, a cut string x, and two indices
i, j 1 ? i ? j ? s
bs,x( i, j ) L s is not left bounded at i
? R s is not right bounded at j
s is left (right) bounded by x if the nucleotide
to the left(right) is x, or the end of the
string.
s GATACC x C bs,x( 2, 3 ) LR bs,x( 2,
4 ) L bs,x ( 1, 4 ) ?

17
Operationsset of bounded strings, compomers

Given a string s, a cut string x,
defineSB(s,x) (si,j, bs,x(i,j) 1 ? i ?
j ? s CB(s,x) (comp(y), b) (y,b) ?
SB(s,x)

18
Operationsset of bounded strings
s ATTCA
19
Operationsset of bounded compomers
s ATTCA
20
Operationsset of bounded compomers
s ATTCA
21
Operationsdistance with boundaries

D(c, b, c) d(c, c) b
Includes a boundary term in the distance
operator.

22
Methods SNP discovery from mass spectrometry
problem

Given a reference string s, and a cut string x
For a compomer c find all s
satisfying c ? C0(s) such that dL(s,s) is
minimal.

23
Methods SNP discovery trivial approach

Simulate the mass spectra for all potential
sequence variations of the reference sequence,
and compare to the measured spectra.
Feasible runtime for single base substitution,
insertion, or deletion.

24
SNP discovery presented approach

Input s, x, and compomer c, a maximal
(compomer) cost, k, n s
Preprocessing
set IBC all indexed bounded
compomers (c,b,i,j) for 1 ? i ? j ? n
subject to ordx(si,j) b ? k

25
SNP discovery presented approach

Input s, x, and compomer c, a maximal
(compomer) cost, k, n s
Preprocessing
set IBC all indexed bounded
compomers (c,b,i,j) for 1 ? i ? j ? n
subject to ordx(si,j) b ? k
ATGTGTCC, xT
GTGTC is deleted with k2

X
X
X
X
X
26
SNP discovery presented approach

Input s, x, and compomer c, a maximal
(compomer) cost, k, n s
Preprocessing
AGCGTTA
AG GC CG GT...

n2
27
SNP discovery presented approach

Input s, x, and compomer c, a maximal
(compomer) cost, k, n s
Preprocessing
AGCGTTA
AGC G2CG ...

n3
28
SNP discovery presented approach

1 Discover a peak in M not in M

29
SNP discovery presented approach

for a new peak p
Build a list of compomers c1 c2 whose mass
explains the peak.
m(CCGGG) m(AAAAA)
for each compomer ci
find a minimum distance k by looking in band of
IBC
c1 (G2TC), c14,look in table bands n3..5
find all compomers c such thatD(c,b,c) k,
call this set Ce

30
SNP discovery presented approach

For each compomer ce ? Ce
generate all strings y such that d(ye,y)
D(ce,b,c), comp(ce) ye.
generate a new string se so that ye is replaced
by y.
store string in a list S.
This will generate a list of potential new
strings with SNPs and mutations.

31
SNP discovery presented approach

Score all potential new strings according to the
mass spectra they would create using all 4 base
specific endonucleases
(4 separate reactions).
Rank strings according to the number of peaks
they correctly reproduce.

32
SNP discoverymotivation

If 1 SNP occurs in compomer c from the middle of
a fragment y, d(c,c), ccomp(y) 1 TACCTAT
A1C2, A1 ? TACGTAT A1C1G1,A1
It is common that a SNP will mutate the
cut-string resulting in compomers with d(c,c) gtgt
k TACCTAT A1C2, A1 ? TACCGAT A2C2G1

33
SNP discoverypreprocessing motivation

Preprocessing step set IBC all indexed
bounded compomers (c,b,i,j) for 1 ? i ? j ? n
subject to ordx(si,j) b ? k
s TACCTAT, IBC T,T,1,1 TA,T,1,2
TAC,T,1,3 TACC,?,1,4
In case s is mutated at cut string, there will be
a bounded compomer to compare it to
sTACCTAT sTACCGAT y ACCTA, ?,2,6,
cA2C2T3 b ?
sTACCTAT sTATCTAT y A,T,2,2, cA1, b
T

34
SNP discovery processing motivation

Processing step is based on the theorem
Given sample strings s, s ? ? with dL(s, s) ?
k, a cut string x ? ? and a compomer c ?
C0(s,x). Then there exists a bounded compomer
(c, b) ? CkB(s,x) such that D(c,b,c) ? dL(s,
s)
One of the bounded compomers found in
preprocessing step will be minimal distance from
c.

35
SNP discoveryminimum k

Based off of notion of independent SNPs.
Independent SNPs SNPs (mutations) that are
separated by a cut-string x.
If a DNA strand has i.i.d. nucleotides, on
average every 7 1/3 nucleotides has all 4
present.
In case studies, average one SNP every 231 base
pairs, minimal distance 14.
k mutations, k1, or k2 sufficient.

36
SNP discoveryruntime

By selecting small k, the search space on the
possible variants can be decreased.
Preprocessing O(n2)
Finding minimum k O(nk)
Transforming y ? y O(?k (mk)k), m c
Runs much faster than trivial approach.

37
Results

30 amplicons from HS chr 22 taken from 12
individuals was analyzed
51 SNPs discovered by manual analysis of data
(MS and electrophoresis)
5/51 false negatives, 7/51 false positives
Thresholds aside, found all 51 SNPs.

38
Discussion

Effectively searches compomer variant space for
explanations of SNPs.
Small k (small difference between compomers) a
reasonable assumption.
Without worrying about the scoring scheme, the
goal of the paper is achieved.

39
Discussion

As for a method for actually finding SNPs, there
is 23 failure rate.
1871 sequence variations per sample generated on
average.
Manual post processing required to detect subtle
peak intensity changes.
Location information may improve error rate --gt
may have been motivation behind Sequencing by
compomers.

Write a Comment

User Comments (0)