Title: SNP and mutation discovery using basespecific cleavage and MALDITOF mass spectrometry
1SNP and mutation discovery using base-specific
cleavage and MALDI-TOF mass spectrometry
- Sebastian Böcker, Bioinformatics (2003) 19,
i44-i53 - Presented by Mark Chaisson, Bioinformatics
2Outline
- Goals
- Data acquisition
- Data types (compomers) Operations
- Methods
- Results
- Discussion
3Goal SNP and mutation discovery
- Biological goal discover what SNPs are present
in a sequence with a known reference strand, in
regions that may have clusters of SNPs. - Computational goal find what (minimal) changes
are necessary to transform a reference spectrum
to a measured one.
4Data acquisition
- Target (sample) DNA strand 100-1000 nt.,
amplified by PCR. - Homozygous with respect to sample (all SNPs
belong to the same strand). - Base specific cleavage through RNAse A, and
selective transcription of forward or reverse
strand of sample DNA. - This method has been submitted for publication
5SEQUENOM MassARRAY
- SNP discovery
- SNP genotyping
- Allele frequency
- Gene expression
- ...
6Data acquisition
7Data acquisition annotated mass spectrum from T1
digestion
8Data acquisition peak selection
- An in-silico prediction of the template (No SNP)
is compared with the measured template. - Differences between sample spectra and predicted
are due to mutations. - Mass can give information about nucleotide
counts, but not order.
9Data types compomers
- s sequence (TGGTCACT)
- c compomer, nucleic acid composition of a
sequence - c comp(s) (Ai,Cj,Tk,Gl),
- comp(TGGTCACT) G2C2A1T3
- Ai i, c net size of c
10Operations string, compomer spectra
- Given a sample string s, a cut string x
- string spectrum S0(s) the set of strings
bounded by x, or the ends of s, and do not
contain x. - compomer spectrum C0(s) compomers of S0.
- Examples ACATGTGCCATTA, x T, S0 ACA,
G, GCCA, ?, A C0 A2C, G1, G1C2A1, A1
11Operations order of a string
- ordx(s) the number of times x appears as a
substring of s. - Most of the time x is one nucleotide.
- Base specific endonuclease.
12Operations count operators
- define c? 0(?) maxc(?), 0
- c? 0(3) 3 c? 0(-1) 0
- define c?0(?) minc(?), 0
- c?0(3) 0 c?0(-1) -1
- Negative counts arise in (c - c)
- Natural compomer c c? 0
- comp(s) is always natural
13Operations distance
- When comparing strings (fragments) dL(s,s)
Levenshtein (edit) distance - Comparing compomers d(c,c) max
(c-c)?0, (c-c)?0 - Note fragments y, y with compomers c, c have
dL(y, y) ? d(c,c)
14Distance between compomers can characterize edit
operations ? c - c
15Operations SNP fragment explanation
- Given a fragment y ? ?, and a compomer c from
a different (unknown) fragment, y let c
comp(y) E(y, c, c) function that generates
all fragments y with comp(y) c, and
dL(y,y) d(c,c) - Uses information from ? ?0 and ??0 to
determine what operations to perform on y.
16Operationsbound determination
- Given a string s, a cut string x, and two indices
i, j 1 ? i ? j ? s - bs,x( i, j ) L s is not left bounded at i
? R s is not right bounded at j - s is left (right) bounded by x if the nucleotide
to the left(right) is x, or the end of the
string. - s GATACC x C bs,x( 2, 3 ) LR bs,x( 2,
4 ) L bs,x ( 1, 4 ) ?
17Operationsset of bounded strings, compomers
- Given a string s, a cut string x,
defineSB(s,x) (si,j, bs,x(i,j) 1 ? i ?
j ? s CB(s,x) (comp(y), b) (y,b) ?
SB(s,x)
18Operationsset of bounded strings
s ATTCA
19Operationsset of bounded compomers
s ATTCA
20Operationsset of bounded compomers
s ATTCA
21Operationsdistance with boundaries
- D(c, b, c) d(c, c) b
- Includes a boundary term in the distance
operator.
22Methods SNP discovery from mass spectrometry
problem
- Given a reference string s, and a cut string x
For a compomer c find all s
satisfying c ? C0(s) such that dL(s,s) is
minimal.
23Methods SNP discovery trivial approach
- Simulate the mass spectra for all potential
sequence variations of the reference sequence,
and compare to the measured spectra. - Feasible runtime for single base substitution,
insertion, or deletion.
24SNP discovery presented approach
- Input s, x, and compomer c, a maximal
(compomer) cost, k, n s - Preprocessing
- set IBC all indexed bounded
- compomers (c,b,i,j) for 1 ? i ? j ? n
- subject to ordx(si,j) b ? k
-
25SNP discovery presented approach
- Input s, x, and compomer c, a maximal
(compomer) cost, k, n s - Preprocessing
- set IBC all indexed bounded
- compomers (c,b,i,j) for 1 ? i ? j ? n
- subject to ordx(si,j) b ? k
- ATGTGTCC, xT
- GTGTC is deleted with k2
X
X
X
X
X
26SNP discovery presented approach
- Input s, x, and compomer c, a maximal
(compomer) cost, k, n s - Preprocessing
- AGCGTTA
- AG GC CG GT...
-
n2
27SNP discovery presented approach
- Input s, x, and compomer c, a maximal
(compomer) cost, k, n s - Preprocessing
- AGCGTTA
- AGC G2CG ...
n3
28SNP discovery presented approach
- 1 Discover a peak in M not in M
29SNP discovery presented approach
- for a new peak p
- Build a list of compomers c1 c2 whose mass
explains the peak. - m(CCGGG) m(AAAAA)
- for each compomer ci
- find a minimum distance k by looking in band of
IBC - c1 (G2TC), c14,look in table bands n3..5
- find all compomers c such thatD(c,b,c) k,
call this set Ce
30SNP discovery presented approach
- For each compomer ce ? Ce
- generate all strings y such that d(ye,y)
D(ce,b,c), comp(ce) ye. - generate a new string se so that ye is replaced
by y. - store string in a list S.
- This will generate a list of potential new
strings with SNPs and mutations.
31SNP discovery presented approach
- Score all potential new strings according to the
mass spectra they would create using all 4 base
specific endonucleases - (4 separate reactions).
- Rank strings according to the number of peaks
they correctly reproduce.
32SNP discoverymotivation
- If 1 SNP occurs in compomer c from the middle of
a fragment y, d(c,c), ccomp(y) 1 TACCTAT
A1C2, A1 ? TACGTAT A1C1G1,A1 - It is common that a SNP will mutate the
cut-string resulting in compomers with d(c,c) gtgt
k TACCTAT A1C2, A1 ? TACCGAT A2C2G1
33SNP discoverypreprocessing motivation
- Preprocessing step set IBC all indexed
bounded compomers (c,b,i,j) for 1 ? i ? j ? n
subject to ordx(si,j) b ? k - s TACCTAT, IBC T,T,1,1 TA,T,1,2
TAC,T,1,3 TACC,?,1,4 - In case s is mutated at cut string, there will be
a bounded compomer to compare it to - sTACCTAT sTACCGAT y ACCTA, ?,2,6,
cA2C2T3 b ? - sTACCTAT sTATCTAT y A,T,2,2, cA1, b
T
34SNP discovery processing motivation
- Processing step is based on the theorem
- Given sample strings s, s ? ? with dL(s, s) ?
k, a cut string x ? ? and a compomer c ?
C0(s,x). Then there exists a bounded compomer
(c, b) ? CkB(s,x) such that D(c,b,c) ? dL(s,
s) - One of the bounded compomers found in
preprocessing step will be minimal distance from
c.
35SNP discoveryminimum k
- Based off of notion of independent SNPs.
- Independent SNPs SNPs (mutations) that are
separated by a cut-string x. - If a DNA strand has i.i.d. nucleotides, on
average every 7 1/3 nucleotides has all 4
present. - In case studies, average one SNP every 231 base
pairs, minimal distance 14. - k mutations, k1, or k2 sufficient.
36SNP discoveryruntime
- By selecting small k, the search space on the
possible variants can be decreased. - Preprocessing O(n2)
- Finding minimum k O(nk)
- Transforming y ? y O(?k (mk)k), m c
- Runs much faster than trivial approach.
37Results
- 30 amplicons from HS chr 22 taken from 12
individuals was analyzed - 51 SNPs discovered by manual analysis of data
(MS and electrophoresis) - 5/51 false negatives, 7/51 false positives
- Thresholds aside, found all 51 SNPs.
38Discussion
- Effectively searches compomer variant space for
explanations of SNPs. - Small k (small difference between compomers) a
reasonable assumption. - Without worrying about the scoring scheme, the
goal of the paper is achieved.
39Discussion
- As for a method for actually finding SNPs, there
is 23 failure rate. - 1871 sequence variations per sample generated on
average. - Manual post processing required to detect subtle
peak intensity changes. - Location information may improve error rate --gt
may have been motivation behind Sequencing by
compomers.