Pick a bug, any bug. - PowerPoint PPT Presentation

About This Presentation
Title:

Pick a bug, any bug.

Description:

... .00 3009.29 4.02 16.23 2.00 4.71 1.00 1.00 2.00 2.00 1.00 2.00 3.33 1.00 1.00 2.00 2.00 1.00 2.00 26.24 2.00 10.05 2.00 23766.62 14.54 2.00 4.25 4.94 1.00 1.00 ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 52
Provided by: ToddDe5
Category:
Tags: bug | pick

less

Transcript and Presenter's Notes

Title: Pick a bug, any bug.


1
Pick a bug, any bug.
Comprehensive Aligned Sequence Construction for
Automated Design of Effective Probes (CASCADE-P)
  • The design and practical use of rDNA
    oligonucleotide microarrays to identify microbes
    in complex samples
  • Todd DeSantis, Igor Dubosarskiy (Ceres), Sonya
    Murray, Gary Andersen
  • Environmental Molecular Microbiology Group
  • BBRP - LLNL

2
The worries of an Italian parent
Is he getting enough sleep?
Does he feel accepted?
Is there sufficient diversity in his lower G.I.
bacterial community?
Are the straps on his car seat irritating his
neck?
Are there archaeal organisms are in the aerosols
he breathes?
Enzo Salvatore DeSantis
Jenny DiGiovanni DeSantis
3
  • Every soiled diaper sacrificed to the Diaper
    Genie is lost data.
  • But who wants to do all the work?
  • Culture
  • anaerobes
  • non-cultivable
  • Sequencing 16S rDNA
  • Need to create, clone, process hundreds of
    samples
  • Can we create a simple, quantitative,
    comprehensive microbial test?

4
Outline
  • Goals
  • Experimental approach
  • Why create a new 16S rDNA database?
  • How do you align gt65,000 sequences?
  • Organization of sequences into types
  • Designing probes for each type
  • Reassessing probe specificity as database grows
  • Using 16S GeneChip for quantitative aerosol
    analysis

5
Project Overview
  • Create a single GeneChip capable of detecting
    and quantifying bacterial and/or archaeal
    organisms in a complex sample.
  • What is in a sample, as opposed to what is not in
    a sample.
  • Approach
  • Combinatorial power of multiple probes for
    sequence-specific hybridization

6
General Protocol
Sample Air Soil Feces Blood
Random hexamer total gDNA amplification Allen
Christian
gDNA
Universal 16S rDNA PCR
rRNA
Contains probes adhered to glass surface in grid
pattern.
7
16S rRNA gene (16S rDNA)
  • Used to identify and classify organisms by gene
    sequence variations.
  • Variations have been used in design of DNA probes
    for the detection of
  • taxonomic domains, divisions, groups
  • specific organisms

8
The Ribosome
rDNA
rRNA (functional molecule)
LSU
SSU 16s or 18s
9
The Ribosome
  • Folded secondary structure
  • Essential functional component
  • Conserved spans
  • structure must be retained for viability
  • targeted for universal/group-specific PCR primers
    and probes
  • Variable regions
  • spans not fundamental to the folded structure
  • receive less pressure from natural selection
  • probed for genus and species level discrimination

10
Building upon two decades of 16S gene cataloging
  • Over 75,000 16S records housed at NCBI.
  • Grows every week.
  • Dont need to build a reference library for
    organism ID.
  • FAME
  • MS
  • Surface Antigen Identification

11
What could be amplified?
  • Universal 16S PCR primers ? complex population of
    amplicons.
  • Must define the targets to consider as the
    Potential Amplicon Set or PAS.
  • Give me a file of all the amplicon sequences
    possible if we used universal primers upon a
    sample containing every prokaryote.

Tom Kusmarski
Variable
12
Difficulties defining the PAS
  • Why Entrez search for 16S is insufficient
  • non-16S sequences are errantly deposited as 16S
  • 16S sequences may be concealed within longer
    records that cover entire operons or genomes
  • anti-sense strands aren't specified as such

13
Difficulties defining the PAS
  • For each sequence, need to trim away bases that
    wont be amplified.
  • Problem Most 16S records are partial.
  • Can primer pattern matching within sequences
    allow for proper trimming?
  • What does it mean when a primer search fails?
  • Primer locus present in record but mutated, or
  • Primer locus outside the sequence span deposited

14
Difficulties defining the PAS
  • Aligned sequences were necessary.
  • A 16S MSA arranged as horizontal rows of
    characters allows vertical slices to be extracted
    between columns of primer annealing positions.

15
Existing 16S aligned databases
  • Under 20,000 sequences among 3 databases.
  • Updates occurred annually or worse.
  • Others focused on hand-aligning complete
    sequences.
  • Structure predictions
  • Phylogeny assessment
  • Comprehensive Aligned Sequence Construction for
    Automated Design of Effective Probes (CASCADE-P)
  • need up-to-date records
  • need to include partial sequences
  • add inertia to region considered conserved
  • increase the likelihood of detecting a
    polymorphism
  • searched for unwanted cross-hybridizations with a
    tentative probe

16
RDP alignment and tree a great skeleton
  • Ribosomal Database Project (Michigan State)
  • 16S MSA of 16 277 seqs (v8.1)
  • trimmed of extra-16S
  • top-strand oriented
  • 1,541 bases stretched to 4,218 characters
  • Each placed within a hierarchical phylogenetic
    tree

17
Sequence pre-processing
  • Download 16S candidates via ESearch
  • BLAST compare to 16S/18S RDP standards.
  • Candidates were rejected if
  • the longest match length was lt300 base pairs
  • the highest scoring BLAST subject was eukaryotic
  • candidate matched sequences in two or more RDP
    terminal tree branches equally well
  • Phylocode assigned from top HSP

18
Sequence pre-processing
19
Sequence pre-processing
  • Candidate trimmed of extra-16S seq data
  • tRNA genes, intergenic spacer regions, and 23S
    rDNA
  • based on HSP boundries
  • If HSP paired opposite strands, candidate was
    reverse complemented.

20
Sequence pre-processing
21
Sequence pre-processing
  • The "template" was assigned from the top HSP from
    a second BLAST process
  • G1, E1.
  • Favors longer, but less identical matches.

22
Prokaryotic Multiple Sequence Alignment prokMSA
  • Essentially, the prokMSA was a merger built
    serially by aligning each candidate to its
    closest relative in the RDP tree.
  • Two Steps
  • Align0
  • Publicly available, pair-wise SW aligner
  • Candidate expansion
  • NAST
  • Nearest Alignment Space Termination
  • Novel algorithm
  • Candidate compression

23
NAST
  • DEFINE
  • St post-Align0 template sequence.
  • Sc post-Align0 candidate sequence.
  • Ht alignment space (hyphen) inserted into St by
    Align0.
  • Hc alignment space (hyphen) inserted into Sc by
    Align0.
  •  
  • WHILE (St contains one or more Ht) DO
  • LHt character index of distal 5' Ht within St
  • L5' character index of Hc within Sc which is
    5' proximal to Ht
  • L3' character index of Hc within Sc which is
    3' proximal to Ht
  • IF ((LHt L5') gt (L3' LHt)) Delete Hc found
    at L3'
  • ELSE Delete Hc found at L5'
  • Delete template gap character.
  • END WHILE

24
December 28th, 2002
25
Operational Taxonomic Units
  • We did not desire to design probes for each
    sequence.
  • Many sequences are nearly identical.
  • Desired ?20 probe per sequence
  • 60,000 seqs ( 20 probes 20 probes)
  • 2.4 million probes (not possible, yet)
  • We did desire to design probes for each type of
    sequence.
  • Need to group sequences into types amenable to
    probe design.

26
Operational Taxonomic Units
  • Avoid groupings based on historical nomenclature.
  • Sequence-dependent classification by transitive
    similarity clustering at 98.
  • Create groupings into Operational Taxonomic Units
    (OTU).
  • Each sequence must be in exactly 1 OTU

if x R y y R z ? x R z
27
Sequences Clustered
BACTERIA (2) GRAM_POSITIVE_BACTERIA (2.30)
BACILLUS-LACTO-STREPTOCOC_SUBDIVISION (2.30.7)
CARNOBACTERIUM_GROUP (2.30.7.18)
CRN.DIVERGENS_SUBGROUP (2.30.7.18.2) OTU
2.30.7.18.2.012 (11 sequence records)
AF244371 Nostocoida limicola I Ben200
AF244372 Nostocoida limicola I Ben201
AF244375 Nostocoida limicola I Ben77
AF255736 Nostocoida limicola I AF276462
Uncultured bacterium clone RFLP 102E
AF394926 Lactosphaera sp. PMagG1 AJ296179
Ruminococcus palustris DSM 9172T AJ306612
Trichococcus collinsii 37AN3 L76599
Lactosphaera pasteurii ATCC 35945 X87150
Lactosphaera pasteurii DSM 2381 Y17301
Trichococcus flocculiformis DSM 2094
  • Is IB ? Lm / min (La, Lb)

28
Some OTUs contain hundreds of sequences. Example
Many isolates of a human pathogen.
Some species are found in over 20 OTUs.
  • Bioinformatics manuscript input
  • Sonya Murray
  • Peter Agron
  • Sadhana Chauhan

29
http//greengenes.llnl.gov/16S
  • Comprehensive Aligned Sequence Construction for
    Automated Design of Effective Probes
  • Igor Dubosarskiy
  • Java implementations
  • Tim Harsch
  • RDBMS consultations
  • Lisa Corsetti
  • Apache module management
  • Kevin Melissare
  • Graphics

30
My Interest List
  • Able to define a region of the tree that is
    important to you.
  • Your list will be remembered between visits.

31
Picking Probes for GeneChip Microarray
1492R
pA
27
1507
Slice taken from prokMSA
  • Select 20 to 28 probe pairs for each of 8,432
    OTUs
  • Ideal Perfect Match Probe
  • 25mer
  • Present in all sequences of the OTU
  • Not present outside the OTU
  • Unable to X-hybe with seqs in other OTUs
  • Ideal Mis-match Control Probe
  • Unable to X-hybe within entire PAS

32
Scoring Probe Candidates
  • Score is calculated for each potential probe
    pair.
  • Product of 3 factors
  • Locus Specific Prevalence Factor
  • PerfectMatch X-hybe Factor
  • The closer the tree distance, the lower the
    factor
  • MisMatch X-hybe Factor
  • 3 bases available
  • use base which produces highest factor

33
OTU composed of 26 sequences
Locus Specific Prevalence Scoring
34
Cross Hybridization
Central Data Structure 17mer_hashAGCTATTATAGCTG
CAG2.30.7.12.4.004 1 17mer_hashAGCTATTATA
GCTGCAG2.30.7.12.4.009 1 17mer_hashAGCTAT
TATAGCTGCAG2.30.7.12.3.001 1 1.2
Gb Allowed rapid lookup of all OTUs containing a
particular 17mer
Consultations Mark Wagner Tom Slezak Tom
Kuczmarski Mike Mittman, Affymetrix
35
OTU-2.30.7.18.2.002
Sample Probe Rankings
RANK
36
Combinatorial scoring of Probe Sets are able to
categorize mixed samples.
S. aureus spike
Art Koybayashi Simulation package
B. anthracis spike
37
Combinatorial scoring of Probe Sets are able to
categorize mixed samples.
Hybridization results from spike-in experiment
done in triplicate. Sonya Murray Aubree Hubbel
Percent of probe-pairs scored positive for each
probe set in the Staphylococcus Group.
38
PAS is a moving target
  • Problems
  • New 16S sequence data is constantly being
    deposited to public databases.
  • Mismatch Probes can become Perfect Matches.
  • Phylogenetic groupings (OTUs) can change.
  • New transitive links may be discovered

39
Dynamic probe associations
original (static)
dynamic
Name2.30.7.12.004 BlockNumber1 NumAtoms33 NumCe
lls66 CellHeaderX Y PROBE FEAT
QUAL Cell16 59
TCAAACATTGCGGGCTTCAG 2.30.7.12.004
Cell26 58 TCAAACATTGTGGGCTTCAG
2.30.7.12.004 Cell3171 202
AAACATTGTGGGCTTCAGCC 2.30.7.12.004
Cell4171 203 AAACATTGTGTGCTTCAGCC
2.30.7.12.004 Cell5197 163
CATTGTGGGCATCAGCCACC 2.30.7.12.004
Cell6197 162 CATTGTGGGCTTCAGCCACC
2.30.7.12.004 Cell7151 175
ATTGTGGGCTGCAGCCACCC 2.30.7.12.004
Cell8151 174 ATTGTGGGCTTCAGCCACCC
2.30.7.12.004 Cell9228 2
TTGTGGGCTTCAGCCACCCC 2.30.7.12.004
Cell10228 3 TTGTGGGCTTTAGCCACCCC
2.30.7.12.004 Cell11139 22
TGTGGGCTTCAGCCACCCCA 2.30.7.12.004
Cell12139 23 TGTGGGCTTCGGCCACCCCA
2.30.7.12.004 Cell1394 76
GGGCTTCAGCCACCCCATTG 2.30.7.12.004
Cell1494 77 GGGCTTCAGCTACCCCATTG
2.30.7.12.004 Cell15111 118
CTTCAGCCACCCCATTGGAA 2.30.7.12.004
  • Ability to re-map chip to up-to-date 16S data
  • Many of the existing probes on the physical array
    are complementary to new sequences.
  • Probes originally deemed MM can become PM to some
    organisms.
  • mySQL database used for association maintenance

40
Finding groupings
probes
seq 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
sequences
Consider A O to be 16S sequences. Consider 1
24 to be probes already embedded on the
chip. First, associate all available probes with
all available sequences. Let probe similarities
drive sequence groupings.
41
Finding groupings
seq 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A O to be 16S sequences. Consider 1
24 to be probes already embedded on the
chip. First, associate all available probes with
all available sequences. Let probe similarities
drive sequence groupings.
42
Finding groupings
seq 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A O to be 16S sequences. Consider 1
24 to be probes already embedded on the
chip. First, associate all available probes with
all available sequences. Let probe similarities
drive sequence groupings.
43
Finding groupings
seq 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A O to be 16S sequences. Consider 1
24 to be probes already embedded on the
chip. First, associate all available probes with
all available sequences. Let probe similarities
drive sequence groupings.
44
Finding groupings
seq 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A O to be 16S sequences. Consider 1
24 to be probes already embedded on the
chip. First, associate all available probes with
all available sequences. Let probe similarities
drive sequence groupings.
45
Progressive Cyclical Grouping
  • DEFINE uGBpp as the number of useful probe pairs
    which globally differentiate a cluster from all
    other sequences.
  • FOR uGBpplock (11 .. 4) DO
  • FOR uPWppsep (1 .. 10) DO
  • Determine uGBppclust for each cluster.
  • Lock all clusters where uGBppclust uGBpplock
    .
  • Pair-wise (PW) compare each non-locked cluster
    (clustA, clustB)
  • uPWppclustA number of useful probe pairs
    which PW differentiate clustA from clustB
  • uPWppclustB number of useful probe pairs
    which PW differentiate clustB from clustA
  • Merge sequences of clustA and clustB into one
    cluster unless uPWppclustA uPWppsep AND
    uPWppclustB uPWppsep
  • END FOR
  • Separate all sequences in non-locked clusters so
    that each sequence is the sole element of its own
    cluster.
  • END FOR

46
Quantitative Analysis
  • Could the concentration of each amplicon in a
    sample be measured by fluorescence intensity?
  • Experimental setup for 20 point calibration

SPIKE CONCENTRATION (pM in Hybridization
Solution)
Experiment Lc.oenos Fer.nod Sap.grand M.neuro H20 16S amplicons
1 5 13 31 74 No Yes
2 13 31 74 143 No Yes
3 31 74 143 5 No Yes
4 74 143 5 13 No Yes
5 143 5 13 31 No Yes
6 0 0 0 0 Yes Yes
Sonya Murray Carol Stone
18uL of products from 30 cycle universal 16S
PCR of gDNA extracted from U.K. air sample.
47
Quantitative Analysis
Log2 transformed Linear Least Squares Regression
Pearsons corr coeff was significant (df18) 95
confidence intervals calculated according to
National Measurement System Valid Analytical
Measurement Programme (VAM)
48
Quantitative Analysis
  • Environmental community is measured with
    confidence intervals.

49
Summary
  • prokMSA contains 65,000 aligned sequences and
    growing (largest collection).
  • Over 8,000 distinct OTUs have been found.
  • Global probe-picking was completed.
  • DOE 16S GeneChip was manufactured.
  • Ability to correctly categorize spike-ins is
    being validated.
  • Detected amplicons can be quantified.

50
Acknowledgements
  • Gary Andersen Group Leader (The Tangent
    Terminator)
  • Carol Stone Sample collection, hybridization
    (DSLT)
  • Aubree Hubbel Spike synthesis
  • Sonya Murray - Hybridizations
  • Peter Agron ms advise
  • Sadhana Chauhan ms advice
  • Mike Mittman probe selection constraints
    (Affymetrix)
  • Art Koybayashi Hyb simulation package, ms
    advice
  • Tom Kusmarski - PAS, algorithm optimization
  • Tom Slezak - algorithm optimization
  • Mark Wagner - algorithm optimization
  • Igor Dubosarskiy Java, web front-end (Ceres)
  • Tim Harsch - RDBMS
  • Lisa Corsetti Apache administration
  • Allen Christensen genomic sample amplification
  • This work was performed under the auspices of the
    U.S. Department of Energy by the University of
    California, Lawrence Livermore National
    Laboratory, under contract no. W-7405-Eng-48.

51
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com