Title: AMOS tools for assembly validation
1AMOS tools for assembly validation
- Automatically scan an assembly to locate
misassembly signatures for further analysis and
correction - Load Assembly Data into Bank
- Evaluate Mate Pairs Libraries
- Evaluate Read Alignments
- Evaluate Read Breakpoints
- Analyze Depth of Coverage
- Identify Surrogates
- Load Misassembly Signatures into Bank
AMOS Bank
http//amos.sourceforge.net
2Assembly QC mate happiness
- Evaluate mate happiness across assembly
- Happy Correct orientation and distance
- Finds regions with multiple
- Compressed Mates (too close together)
- Expanded Mates (too far apart)
- Invalid same orientation (? ?)
- Invalid outie orientation (? ?)
- Missing Mates
- Linking mates (mate in a different scaffold)
- Singleton mates (mate is not in any contig)
- Regions with high C/E statistic
3Mate happiness
- Excision Skip reads between flanking repeats
- Truth
- Misassembly Compressed Mates, Missing Mates
4Mate happiness
- Insertion Additional reads between flanking
repeats - Truth
- Misassembly Expanded Mates, Missing Mates
5Mate happiness
- Rearrangement Reordering of reads
- Truth
- Misassembly Misoriented Mates
Note if A,B too far apart, mates may all be
happy
6Compression/Expansion (C/E) Statistic
- The presence of individual compressed or expanded
mates is rare but expected - Do the inserts spanning a given position differ
from the rest of the library? - Flag large differences as potential misassemblies
- Even if each individual mate is happy
- Compute the statistic at all positions
- (Local Mean Global Mean) / Scaling Factor
- Introduced by Jim Yorkes group at UMD
7Library size variation
2kb
4kb
6kb
0kb
8 inserts 3kb-6kb Local Mean 4048 C/E Stat
(4048-4000) 0.33 (400 / v8)
Near 0 indicates overall happiness
8C/E statistic Compression
2kb
4kb
6kb
0kb
8 inserts 3.2 kb-4.8kb Local Mean 3488 C/E
Stat (3488-4000) -3.62 (400
/ v8) C/E Stat -3.0 indicates Compression
9Read Alignment
- Multiple reads with same conflicting base are
unlikely - 1x QV 30 1/1000 base calling error
- 2x QV 30 1/1,000,000 base calling error
- 3x QV 30 1/1,000,000,000 base calling error
- Correlated SNPs are likely to be assembly errors,
usually collapsed repeats - AMOS Tools analyzeSNPs clusterSNPs
- Locate regions with high rate of correlated SNPs
- Parameterized thresholds
- Multiple positions within 100bp sliding window
- 2 conflicting reads
- Cumulative QV gt 40 (1/10000 base calling error)
A G C A G C A G C A G C A G C A G C C T A C T A C
T A C T A C T A
10Read breakpoints compression error
ribosomal RNA repeats, B. anthracis
- QC METHOD
- Align singleton reads to consensus assembly
- Find any breakpoints shared by multiple reads
chimeric reads
mates
11Uncompress by creating new repeat copy
Reference B. anthracis Ames ancestor strain
B. anthracis Ames Porton Down strain
Tandem duplication
12Read Coverage
- Find regions of contigs where the depth of
coverage is unusually high - AMOS Tool analyzeReadDepth
- 2.5x mean coverage
B
A
R1 R2
A
R1
B
R2
13Hawkeye assembly viewer and debugger
14Launch Pad
15Histograms Statistics
Insert Size
Read Length
GC Content
Overall Statistics
- Birds eye view of data and assembly quality
16Scaffold View
- Statistical Plots
- Scaffold
- Features
- Clone inserts
- Overview
- Control Panel
- Details
17Standard Feature Types
- B Breakpoint
- Alignment ends at this position
- C Coverage
- Location of unusual mate coverage (asmQC)
- S SNPs
- Location of Correlated SNPs
- U Unitig
- Used to report location of surrogate unitigs in
CA assemblies - X Other
- All other Features
18Insert (mate) Happiness
- Happy
- Oriented Correctly
- Insert Size Library.mean lt Happy-Distance
Library.sd - Stretched
- Oriented Correctly
- Insert Size gt Library.mean Happy-Distance
Library.sd - Compressed
- Oriented Correctly
- Insert Size lt Library.mean - Happy-Distance
Library.sd - Misoriented
- Same or Outies
- Linking
- Reads mate is in some other scaffold
Both mates present
Only 1 read present
19Contig View detailed alignment of reads to
contigs
20SNP View
SNP Sorted Reads
Polymorphism View
21SNP Barcode
SNP Sorted Reads
Colored Rectangle indicate the positions and
composition of the SNPs
22Scaffold View
Coverage
CE Statistic
SNP Feature
Happy
Stretched
Compressed
Misoriented
Linking
23Collapsed Repeat
Read Coverage Spike
-5.5 CE Dip
Compressed Mates Cluster
68 Correlated SNPs
24Example 1 Compression in Prevotella intermedia
17assembly, found by the CE statistic
- Green inserts are lt2 standard deviations from
the mean, and the orange inserts are compressed
by gt 2 standard deviations. - Vertical yellow line shows the most likely place
of a compression misassembly. - Only one insert in this case is compressed by gt 3
standard deviations
25Example 2 Compression in Prevotella intermedia
17assembly, found by the CE statistic
26Fixing collapsed repeats with AMOS
Original Contig
Compression Point
Before
Patch Contig
Resolved Stitched Contig
After
27Assemblies can be preserved at NCBIs Assembly
Archivehttp//www.ncbi.nlm.nih.gov/Traces/assembl
y/assmbrowser.cgi