Model-based species identification using DNA barcodes - PowerPoint PPT Presentation

About This Presentation
Title:

Model-based species identification using DNA barcodes

Description:

Existence Barcoding gap ... 3.0 Microsoft Office Excel Chart Model-based species identification using DNA barcodes Outline Background on DNA barcoding ... – PowerPoint PPT presentation

Number of Views:244
Avg rating:3.0/5.0
Slides: 31
Provided by: Bog648
Category:

less

Transcript and Presenter's Notes

Title: Model-based species identification using DNA barcodes


1
Model-based species identification using DNA
barcodes
Bogdan Pasaniuc
CSE Department, University of Connecticut
Joint work with Ion Mandoiu and Sotirios Kentros
2
Outline
  • Existing approaches to species identification
  • Proposed statistical model based methods
  • Experimental Results
  • Ongoing Work and Conclusions

3
Background on DNA barcoding
  • Recently proposed tool for species identification
  • Use short DNA region as fingerprint for the
    species
  • Region of choice cytochrome c oxidase subunit 1
    mitochondrial gene ("COI", 648 base pairs long).
  • Key assumption inter-species variability higher
    than intra-species variability

4
Species identification problem
  • Given
  • Database DB containing barcodes from known
    species
  • New barcode x
  • Find
  • a high confidence assignment to a species in the
    DB
  • UNKNOWN, if confidence not high enough
  • Use additional evidence/methods to resolve
    UNKNOWN assignments and possible discovery of new
    species

5
Existing approaches and limitations
  • Neighbor Joining tree for new known barcodes
    MeyersPaulay05
  • One barcode per species
  • Runtime does not scale well with species
    (quadratic or worse)
  • Likelihood ratio test for species membership
    using MCMC MatzNielsen06
  • Impractical runtime even for moderate species
  • Distance-based BOLD-IDS, TaxI(Steinke et al.05)
  • Unclear statistical significance

6
BOLD
  • BOLD The Barcode of Life Data Systems
    RatnasinghamHebert07
  • http//www.barcodinglife.org
  • Currently 28,129 species, 251,429 barcodes
  • Identification System BOLD-IDS
  • Distance-based (NJ tree for visualization)
  • Employs a threshold (less than 1 divergence) to
    get a tight match to a barcode in the DB

7
BOLD-IDS
  • Ekrem et al.07 identifications by the BOLD
    facility must be cautiously evaluated as the
    system at present may return high probabilities
    of placements that obviously are erroneous

8
Outline
  • Existing approaches to species identification
  • Proposed statistical model based methods
  • Experimental Results
  • Ongoing Work and Conclusions

9
Bayesian approach to species identification
  • Assign barcode xx1x2x3xn to species SPi that
    maximizes P(SPix) over all species SPi
  • P(SPix) computed using Bayes theorem P(SPx)
    P(xSP)P(SP)/P(x)
  • Uniform prior P(SP)
  • P(x) constant for fixed x
  • Need model for P(xSP)
  • We explored three scalable models position
    weight matrices, Markov chains, hidden Markov
    models
  • Similar to models used successfully in other
    sequence analysis problems such as DNA motif
    finding and protein families

10
Positional weight matrix (PWM)
  • Assumption independence of loci
  • P(xSP) P(x1SP)P(x2SP)P(xnSP)
  • For each locus, P(xiSP) is estimated as the
    probability of seeing each nucleotide at that
    locus in DB sequences from species SP

11
Inhomogeneous Markov Chain (IMC)
A
A
C
C

start
T
T
G
G
locus 1
locus 2
locus 3
locus 4
  • Takes into account dependencies between
    consecutive loci

12
Hidden Markov Model (HMM)
  • Same structure as the IMC
  • Each state emits the associated DNA base with
    high probability but can also emit the other
    bases with probability equal to mutation rate
  • Barcode x generated along path p with probability
    equal to product of emission transitions along
    p
  • P(xHMM) sum of probabilities over all paths
  • Efficiently computed by forward algorithm

13
Accuracy on BOLD dataset
  10 20 30 40 50
PWM 90.08 90.01 90.02 89.68 89.69
IMC 99.97 99.93 99.90 99.91 99.89
HMM 99.57 99.57 99.66 99.70 99.76
  • 37 species with at least 100 barcodes from BOLD
  • 10-50 barcodes removed and used for test
  • IMC yields better accuracy in all cases

14
Score normalization
  • DB barcodes have non uniform lengths and cover
    different regions of the COI gene
  • Membership probabilities not always comparable
  • Normalization scheme
  • Species models constructed only over positions
    covered in DB
  • Scores normalized using background IMC
    constructed from all sequences in DB

15
Computing the confidence of assignment
  • x assigned to species SP with score s
  • p-value probability that a barcode generated
    under background model ? has a score s ? s
  • Methods for p-value estimation
  • Random sampling
  • Generate random sequences and count how many
    exceed the score
  • Exact computation (for PWMs)
  • Dynamic programming Rahmann03
  • Branch and bound Zhang et. Al 07
  • Shiffted FFTs Nagarajan et al. 05

16
Exact computation for PWMs Rahmann03
  • Computes the entire distribution
  • Scores rounded by a granularity factor
  • Score is a sum of n independent variables (score
    contribution of each position)
  • Probability of a rand. seq. of length i having a
    score of computed from the contribution of
    first i-1 positions and current position

17
Exact computation for IMCs
  • Define as the prob. of a random seq of length i
    having score and last letter
  • Basic recurrence

18
IMC exact p-value computation
  • Initially
  • The probability of a random barcode having score
  • Runtime , where R is the difference
    between max and min score for any i.

19
Outline
  • Existing approaches to species identification
  • Proposed statistical model based methods
  • Experimental Results
  • Ongoing Work and Conclusions

20
Experimental setup (1)
  • Compared methods
  • IMC
  • Species with highest score
  • If score lt species specific threshold ?UNKNOWN
  • Distance-based (BOLD-IDS like)
  • Species containing barcode showing less
    divergence
  • If divergence gt threshold (default 1) ? UNKNOWN
  • Basic questions
  • What is the effect of training set size
    (barcodes per species) on accuracy?
  • What is the effect of the species on accuracy?

21
Experimental setup (2)
  • Two scenarios
  • Complete DB all new barcodes belong to species
    in DB
  • Incomplete DB some new barcodes belong to
    species not in DB

22
Accuracy measures
  • True positive rate TP/(TPFP)
  • Barcodes belonging to species present in the DB
  • TP barcodes assigned to correct species
  • FP barcodes assigned to incorrect species
  • Barcodes belonging to species not present in DB
  • TP barcodes assigned to unknowns
  • FP barcodes assigned to species in the DB

23
Effect of barcodes/species
  • Datasets containing all BOLD species with at
    least 5/25 barcodes
  • BOLD5 1508 sp, 28600 barcodes
  • BOLD25 270 sp, 17197 barcodes
  • DB composed of randomly picked 5-20 barcodes from
    all species in BOLD25
  • Test barcodes
  • Complete database scenario
  • All remaining barcodes from BOLD25
  • Incomplete database scenario
  • All barcodes from BOLD5 not in DB

24

Effect of barcodes/species, complete DB

25
Effect of barcodes/species, incomplete DB
26
Effect of species
  • Datasets containing all BOLD species with at
    least 5/10 barcodes
  • BOLD5 1508 sp, 28600 barcodes
  • BOLD10 690 sp, 23558 barcodes
  • DB composed of randomly picked 100 to 690 species
    from BOLD10
  • 10 barcodes per species
  • Test barcodes
  • Complete database scenario
  • All remaining barcodes from picked species
  • Incomplete database scenario
  • All barcodes from BOLD5 not in DB

27
Effect of species, complete DB
28
Effect of species, incomplete DB
29
Outline
  • Existing approaches to species identification
  • Proposed statistical model based methods
  • Experimental Results
  • Ongoing Work and Conclusions

30
Conclusions Ongoing work
  • IMC provides a scalable method for species
    identification
  • High accuracy, with useful tradeoff between TP
    rate and unknown rate
  • Efficiently computable p-values
  • Comprehensive comparison of identification
    algorithms to be submitted to 2nd International
    Barcode Conference
  • Broad coverage of methods
  • tree-based, distance-based, character-based,
    model-based
  • Assessment of further effects besides species
    and barcodes/species
  • Barcode length
  • Barcode quality
  • Number of regions
  • Runtime scalability (up to millions of species)
  • Diverse datasets (BOLD, cowries, flu viruses,
    simulated data, etc.)
Write a Comment
User Comments (0)
About PowerShow.com