SNP and mutation discovery using basespecific cleavage and MALDITOF mass spectrometry - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

SNP and mutation discovery using basespecific cleavage and MALDITOF mass spectrometry

Description:

Preprocessing: AGCGTTA. AGC G2CG ... n=3 ... preprocessing motivation ... the bounded compomers found in preprocessing step will be minimal distance from c' ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 40
Provided by: markch2
Category:

less

Transcript and Presenter's Notes

Title: SNP and mutation discovery using basespecific cleavage and MALDITOF mass spectrometry


1
SNP and mutation discovery using base-specific
cleavage and MALDI-TOF mass spectrometry
  • Sebastian Böcker, Bioinformatics (2003) 19,
    i44-i53
  • Presented by Mark Chaisson, Bioinformatics

2
Outline
  • Goals
  • Data acquisition
  • Data types (compomers) Operations
  • Methods
  • Results
  • Discussion

3
Goal SNP and mutation discovery
  • Biological goal discover what SNPs are present
    in a sequence with a known reference strand, in
    regions that may have clusters of SNPs.
  • Computational goal find what (minimal) changes
    are necessary to transform a reference spectrum
    to a measured one.

4
Data acquisition
  • Target (sample) DNA strand 100-1000 nt.,
    amplified by PCR.
  • Homozygous with respect to sample (all SNPs
    belong to the same strand).
  • Base specific cleavage through RNAse A, and
    selective transcription of forward or reverse
    strand of sample DNA.
  • This method has been submitted for publication

5
SEQUENOM MassARRAY
  • SNP discovery
  • SNP genotyping
  • Allele frequency
  • Gene expression
  • ...

6
Data acquisition
  • Total digestion

7
Data acquisition annotated mass spectrum from T1
digestion
8
Data acquisition peak selection
  • An in-silico prediction of the template (No SNP)
    is compared with the measured template.
  • Differences between sample spectra and predicted
    are due to mutations.
  • Mass can give information about nucleotide
    counts, but not order.

9
Data types compomers
  • s sequence (TGGTCACT)
  • c compomer, nucleic acid composition of a
    sequence
  • c comp(s) (Ai,Cj,Tk,Gl),
  • comp(TGGTCACT) G2C2A1T3
  • Ai i, c net size of c

10
Operations string, compomer spectra
  • Given a sample string s, a cut string x
  • string spectrum S0(s) the set of strings
    bounded by x, or the ends of s, and do not
    contain x.
  • compomer spectrum C0(s) compomers of S0.
  • Examples ACATGTGCCATTA, x T, S0 ACA,
    G, GCCA, ?, A C0 A2C, G1, G1C2A1, A1

11
Operations order of a string
  • ordx(s) the number of times x appears as a
    substring of s.
  • Most of the time x is one nucleotide.
  • Base specific endonuclease.

12
Operations count operators
  • define c? 0(?) maxc(?), 0
  • c? 0(3) 3 c? 0(-1) 0
  • define c?0(?) minc(?), 0
  • c?0(3) 0 c?0(-1) -1
  • Negative counts arise in (c - c)
  • Natural compomer c c? 0
  • comp(s) is always natural

13
Operations distance
  • When comparing strings (fragments) dL(s,s)
    Levenshtein (edit) distance
  • Comparing compomers d(c,c) max
    (c-c)?0, (c-c)?0
  • Note fragments y, y with compomers c, c have
    dL(y, y) ? d(c,c)

14
Distance between compomers can characterize edit
operations ? c - c
15
Operations SNP fragment explanation
  • Given a fragment y ? ?, and a compomer c from
    a different (unknown) fragment, y let c
    comp(y) E(y, c, c) function that generates
    all fragments y with comp(y) c, and
    dL(y,y) d(c,c)
  • Uses information from ? ?0 and ??0 to
    determine what operations to perform on y.

16
Operationsbound determination
  • Given a string s, a cut string x, and two indices
    i, j 1 ? i ? j ? s
  • bs,x( i, j ) L s is not left bounded at i
    ? R s is not right bounded at j
  • s is left (right) bounded by x if the nucleotide
    to the left(right) is x, or the end of the
    string.
  • s GATACC x C bs,x( 2, 3 ) LR bs,x( 2,
    4 ) L bs,x ( 1, 4 ) ?

17
Operationsset of bounded strings, compomers
  • Given a string s, a cut string x,
    defineSB(s,x) (si,j, bs,x(i,j) 1 ? i ?
    j ? s CB(s,x) (comp(y), b) (y,b) ?
    SB(s,x)

18
Operationsset of bounded strings
s ATTCA
19
Operationsset of bounded compomers
s ATTCA
20
Operationsset of bounded compomers
s ATTCA
21
Operationsdistance with boundaries
  • D(c, b, c) d(c, c) b
  • Includes a boundary term in the distance
    operator.

22
Methods SNP discovery from mass spectrometry
problem
  • Given a reference string s, and a cut string x
    For a compomer c find all s
    satisfying c ? C0(s) such that dL(s,s) is
    minimal.

23
Methods SNP discovery trivial approach
  • Simulate the mass spectra for all potential
    sequence variations of the reference sequence,
    and compare to the measured spectra.
  • Feasible runtime for single base substitution,
    insertion, or deletion.

24
SNP discovery presented approach
  • Input s, x, and compomer c, a maximal
    (compomer) cost, k, n s
  • Preprocessing
  • set IBC all indexed bounded
  • compomers (c,b,i,j) for 1 ? i ? j ? n
  • subject to ordx(si,j) b ? k

25
SNP discovery presented approach
  • Input s, x, and compomer c, a maximal
    (compomer) cost, k, n s
  • Preprocessing
  • set IBC all indexed bounded
  • compomers (c,b,i,j) for 1 ? i ? j ? n
  • subject to ordx(si,j) b ? k
  • ATGTGTCC, xT
  • GTGTC is deleted with k2

X
X
X
X
X
26
SNP discovery presented approach
  • Input s, x, and compomer c, a maximal
    (compomer) cost, k, n s
  • Preprocessing
  • AGCGTTA
  • AG GC CG GT...

n2
27
SNP discovery presented approach
  • Input s, x, and compomer c, a maximal
    (compomer) cost, k, n s
  • Preprocessing
  • AGCGTTA
  • AGC G2CG ...

n3
28
SNP discovery presented approach
  • 1 Discover a peak in M not in M

29
SNP discovery presented approach
  • for a new peak p
  • Build a list of compomers c1 c2 whose mass
    explains the peak.
  • m(CCGGG) m(AAAAA)
  • for each compomer ci
  • find a minimum distance k by looking in band of
    IBC
  • c1 (G2TC), c14,look in table bands n3..5
  • find all compomers c such thatD(c,b,c) k,
    call this set Ce

30
SNP discovery presented approach
  • For each compomer ce ? Ce
  • generate all strings y such that d(ye,y)
    D(ce,b,c), comp(ce) ye.
  • generate a new string se so that ye is replaced
    by y.
  • store string in a list S.
  • This will generate a list of potential new
    strings with SNPs and mutations.

31
SNP discovery presented approach
  • Score all potential new strings according to the
    mass spectra they would create using all 4 base
    specific endonucleases
  • (4 separate reactions).
  • Rank strings according to the number of peaks
    they correctly reproduce.

32
SNP discoverymotivation
  • If 1 SNP occurs in compomer c from the middle of
    a fragment y, d(c,c), ccomp(y) 1 TACCTAT
    A1C2, A1 ? TACGTAT A1C1G1,A1
  • It is common that a SNP will mutate the
    cut-string resulting in compomers with d(c,c) gtgt
    k TACCTAT A1C2, A1 ? TACCGAT A2C2G1

33
SNP discoverypreprocessing motivation
  • Preprocessing step set IBC all indexed
    bounded compomers (c,b,i,j) for 1 ? i ? j ? n
    subject to ordx(si,j) b ? k
  • s TACCTAT, IBC T,T,1,1 TA,T,1,2
    TAC,T,1,3 TACC,?,1,4
  • In case s is mutated at cut string, there will be
    a bounded compomer to compare it to
  • sTACCTAT sTACCGAT y ACCTA, ?,2,6,
    cA2C2T3 b ?
  • sTACCTAT sTATCTAT y A,T,2,2, cA1, b
    T

34
SNP discovery processing motivation
  • Processing step is based on the theorem
  • Given sample strings s, s ? ? with dL(s, s) ?
    k, a cut string x ? ? and a compomer c ?
    C0(s,x). Then there exists a bounded compomer
    (c, b) ? CkB(s,x) such that D(c,b,c) ? dL(s,
    s)
  • One of the bounded compomers found in
    preprocessing step will be minimal distance from
    c.

35
SNP discoveryminimum k
  • Based off of notion of independent SNPs.
  • Independent SNPs SNPs (mutations) that are
    separated by a cut-string x.
  • If a DNA strand has i.i.d. nucleotides, on
    average every 7 1/3 nucleotides has all 4
    present.
  • In case studies, average one SNP every 231 base
    pairs, minimal distance 14.
  • k mutations, k1, or k2 sufficient.

36
SNP discoveryruntime
  • By selecting small k, the search space on the
    possible variants can be decreased.
  • Preprocessing O(n2)
  • Finding minimum k O(nk)
  • Transforming y ? y O(?k (mk)k), m c
  • Runs much faster than trivial approach.

37
Results
  • 30 amplicons from HS chr 22 taken from 12
    individuals was analyzed
  • 51 SNPs discovered by manual analysis of data
    (MS and electrophoresis)
  • 5/51 false negatives, 7/51 false positives
  • Thresholds aside, found all 51 SNPs.

38
Discussion
  • Effectively searches compomer variant space for
    explanations of SNPs.
  • Small k (small difference between compomers) a
    reasonable assumption.
  • Without worrying about the scoring scheme, the
    goal of the paper is achieved.

39
Discussion
  • As for a method for actually finding SNPs, there
    is 23 failure rate.
  • 1871 sequence variations per sample generated on
    average.
  • Manual post processing required to detect subtle
    peak intensity changes.
  • Location information may improve error rate --gt
    may have been motivation behind Sequencing by
    compomers.
Write a Comment
User Comments (0)
About PowerShow.com