Title: AMOS Assembly Validation and Visualization
1AMOS Assembly Validation and Visualization
- Michael Schatz
- October 7, 2008
2Lander-Waterman Statistics
- Sequencing as a random process
- E(contigs) Ne-cs
- E(contig size) L((ecs 1)/c) - (1- s)
- L read length
- T minimum overlap
- G genome size
- N number of reads
- c coverage (NL / G)
- s 1 T/L
3Assembly Reality
- Contigs are never as large as expected
- High coverage is a necessary but not sufficient
condition - Sequencing is basically random, but sequence
composition is not - Repeats control the quality of the assembly
- Assemblers break contigs at ambiguous repeats
- Highly repetitive genomes will be highly
fragmented - Assemblers make mistakes
- Mis-assemblies confuse all downstream analysis
- Tension between overlap error rate and repeat
resolution
4Assembly Evaluation
5Assembly Evaluation
6Genome Assembly Forensics
- Computationally scan an assembly for
mis-assemblies. - Data inconsistencies are indicators for
mis-assembly - Some inconsistencies are merely statistical
variations - amosvalidate
- Load Assembly Data into Bank
- Analyze Mate Pairs Libraries
- Analyze Depth of Coverage
- Analyze Normalized K-mers
- Analyze Read Alignments
- Analyze Read Breakpoints
- Load Mis-assembly Signatures into Bank
AMOS Bank
7Mis-assembly types
Correct
Mis-assembly
Basic mis-assemblies can be combined into more
complicated patterns Insertions, Deletions,
Giant Hairballs
81. Analyze Mate Pairs
- Evaluate mate happiness across assembly
- Happy Correct orientation and distance
- Finds regions with multiple
- Compressed Mates
- Expanded Mates
- Invalid orientation (same ? ? or outie ? ?)
- Missing Mates
- Linking mates (mate in a different scaffold)
- Singleton mates (mate is not in any contig)
9Mate-Happiness Deletion
- Deletion Excise reads between collapsed repeats
- Truth
- Misassembly Compressed Mates, Linking Mates
10Mate-Happiness Insertion
- Insertion Additional reads between flanking
repeats - Truth
- Misassembly Expanded Mates, Missing Mates
11Mate-Happiness Inversion
- Inversion Flipping of reads
- Truth
- Misassembly Misoriented Mates
12Mate-Happiness Rearrangement
- Rearrangement Reordering of reads
- Truth
- Misassembly Misoriented, Compressed, Stretched
Mates
13C/E Statistic
- Easy to detect mis-oriented or missing mates, but
individual compressed or expanded mates are
expected. - Does the distribution of inserts spanning a given
position differ from the rest of the library? - Record large differences as potential
misassemblies - Even if each individual mate is happy
- Compute the statistic at all positions
- (Local Mean Global Mean) / (Global Stdev / vN)
- gt 3 indicates significant expansion
- lt -3 indicates significant compression
- Introduced by Jim Yorkes group at UMD
14Sampling the Genome
15C/E-Statistic Expansion
2kb
4kb
6kb
0kb
8 inserts 3.2kb-6kb Local Mean 4461 C/E Stat
(4461-4000) 3.26 (400 / v8)
C/E Stat 3.0 indicates Expansion
16C/E-Statistic Compression
2kb
4kb
6kb
0kb
8 inserts 3.2 kb-4.8kb Local Mean 3488 C/E
Stat (3488-4000) -3.62 (400
/ v8) C/E Stat -3.0 indicates Compression
172. Read Coverage
- Find regions of contigs where the depth of
coverage is unusually high - Collapsed Repeat Signature
- Can detect collapse of 100 identical repeats
- AMOS Tool analyzeReadDepth
- 2.5x mean coverage
183. Normalized K-mers
- Not all repeats are mis-assembled, but most
mis-assemblies occur at repeats. - Detect repeats by noticing increased read k-mer
coverage. - 2 copy repeat will have 2x the average read
k-mer coverage - N copy repeat will have Nx the average read
k-mer coverage - Normalize read k-mer coverage with consensus
k-mer coverage to highlight just the repeats with
missing copies. - The sequence of an N copy repeat should occur N
times in the consensus.
19Normalized K-mers
- Correct assembly of 2 copy repeat
20Normalized K-mers
- Mis-assembly of 2 copy repeat
- Effective even if reads are left as singletons
Normalized K-mers
214. Read Alignment
- Multiple reads with same conflicting base are
unlikely - 1x QV 30 1/1000 base calling error
- 2x QV 30 1/1,000,000 base calling error
- 3x QV 30 1/1,000,000,000 base calling error
- Regions of correlated SNPs are likely to be
assembly errors or interesting biological events - Highly specific metric
- AMOS Tools analyzeSNPs clusterSNPs
- Locate regions with high rate of correlated SNPs
- Parameterized thresholds
- Multiple positions within 100bp sliding window
- 2 conflicting reads
- Cumulative QV gt 40 (1/10000 base calling error)
A G C A G C A G C A G C A G C A G C C T A C T A
C T A C T A C T A
225. Read Breakpoints
375
- Align singleton reads to consensus sequences.
- A consistent breakpoint shared by multiple reads
can indicate a collapsed repeat. - Initially developed to detect collapsed repeat in
Bacillus Anthracis.
BAPDN53TF 786bp
665
BAPDF83TF 786bp
428
BAPCM37TR 697bp
668
BAPBW17TR 1049bp
144337
146944
16S rRNA
144203
146226
145515
147021
RA
RB
23Performance
Combining signatures into suspicious regions
greatly improves specificity.
24Hawkeye
25Hawkeye Goals
- Interactively explore and analyze
- Libraries
- Insert Sizes, Read Length, Inserts
- Scaffolds Contigs
- Sizes, Composition, Sequence
- Multiple Alignment, SNP Barcode
- Read Coverage, k-mer Coverage
- Inserts
- Happiness, Coverage, CE Statistic
- Reads
- Clear Range, Quality Values, Chromatograms
- Features
- Arbitrary regions of interest
- Including Mis-assembly Signatures!!!
Overview Zoom Filter Details on Demand
26Launch Pad
27Histograms Statistics
Insert Size
Read Length
GC Content
Overall Statistics
- Birds eye view of data and assembly quality
28Scaffold View
- Statistical Plots
- Scaffold
- Features
- Inserts
- Overview
- Control Panel
- Details
29Insert Happiness
- Happy
- Oriented Correctly
- Insert Size Library.mean lt Happy-Distance
Library.sd - Stretched
- Oriented Correctly
- Insert Size gt Library.mean Happy-Distance
Library.sd - Compressed
- Oriented Correctly
- Insert Size lt Library.mean - Happy-Distance
Library.sd - Misoriented
- Same or Outies
- Linking
- Reads mate is in some other scaffold
Both mates present
Only 1 read present
30Standard Feature Types
- B Breakpoint
- Alignment ends at this position
- C Coverage
- Location of unusual mate coverage (asmQC)
- S SNPs
- Location of Correlated SNPs
- U Unitig
- Used to report location of surrogate unitigs in
CA assemblies - X Other
- All other Features
Loading Features loadFeatures bankname
featfile Featfile format Contigid type end5
end3 comment
31Contig View
32Contig View Expanded
Quality Values
Normalized Chromatogram
Chromatograms are loaded from specified
directories, or on demand from Trace Archive.
33Assembly Reports
Contigs
Features
Reads
Scaffolds
- Full Integration Double click takes you there
34SNP View
SNP Sorted Reads
Polymorphism View
35SNP Barcode
SNP Sorted Reads
Colored Rectangle indicate the positions and
composition of the SNPs
36Scaffold View
Coverage
CE Statistic
SNP Feature
Happy
Stretched
Compressed
Misoriented
Linking
37Collapsed Repeat
Read Coverage Spike
-5.5 CE Dip
Compressed Mates Cluster
68 Correlated SNPs
38Confirmed Misassembly
Misassembly
Fixed
- Collapsed repeat
- Compressed mates (-5.5 CE Stat)
- Correlated SNPs (68 Positions within 1400bp)
- Spike in Read Coverage
39Fixing the collapsed repeat
- Select reads and mates in region of collapse.
- AMOS findMissingMates, select-reads
- Reassemble those reads with a stricter unitigger
error rate. - AMOS minimus
- Patch the collapsed region of the original
assembly with corrected version. - AMOS stitchContigs
40stitchContigs
Before
After
The multiple alignment of the patch replaces the
multiple alignment between the two stitch reads.
Otherwise, reads left of the left stitch read and
right of the right stitch read are taken directly
from the original master contig in their original
multiple alignment, so the master contig and the
stitched contig are identical except near the
compression point.
41More Information
- AMOS Webpages
- http//amos.sourceforge.net/forensics
- http//amos.sourceforge.net/hawkeye
- Contact AMOS
- amos-help at lists.sourceforge.net
- Acknowledgements