AMOS Assembly Validation and Visualization - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

AMOS Assembly Validation and Visualization

Description:

AMOS Assembly Validation and Visualization. Michael ... Inversion: Flipping of reads. Truth. Misassembly: Misoriented Mates. Mate-Happiness: Rearrangement ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 42
Provided by: Michael2026
Category:

less

Transcript and Presenter's Notes

Title: AMOS Assembly Validation and Visualization


1
AMOS Assembly Validation and Visualization
  • Michael Schatz
  • October 7, 2008

2
Lander-Waterman Statistics
  • Sequencing as a random process
  • E(contigs) Ne-cs
  • E(contig size) L((ecs 1)/c) - (1- s)
  • L read length
  • T minimum overlap
  • G genome size
  • N number of reads
  • c coverage (NL / G)
  • s 1 T/L

3
Assembly Reality
  • Contigs are never as large as expected
  • High coverage is a necessary but not sufficient
    condition
  • Sequencing is basically random, but sequence
    composition is not
  • Repeats control the quality of the assembly
  • Assemblers break contigs at ambiguous repeats
  • Highly repetitive genomes will be highly
    fragmented
  • Assemblers make mistakes
  • Mis-assemblies confuse all downstream analysis
  • Tension between overlap error rate and repeat
    resolution

4
Assembly Evaluation
5
Assembly Evaluation
6
Genome Assembly Forensics
  • Computationally scan an assembly for
    mis-assemblies.
  • Data inconsistencies are indicators for
    mis-assembly
  • Some inconsistencies are merely statistical
    variations
  • amosvalidate
  • Load Assembly Data into Bank
  • Analyze Mate Pairs Libraries
  • Analyze Depth of Coverage
  • Analyze Normalized K-mers
  • Analyze Read Alignments
  • Analyze Read Breakpoints
  • Load Mis-assembly Signatures into Bank

AMOS Bank
7
Mis-assembly types
Correct
Mis-assembly
Basic mis-assemblies can be combined into more
complicated patterns Insertions, Deletions,
Giant Hairballs
8
1. Analyze Mate Pairs
  • Evaluate mate happiness across assembly
  • Happy Correct orientation and distance
  • Finds regions with multiple
  • Compressed Mates
  • Expanded Mates
  • Invalid orientation (same ? ? or outie ? ?)
  • Missing Mates
  • Linking mates (mate in a different scaffold)
  • Singleton mates (mate is not in any contig)

9
Mate-Happiness Deletion
  • Deletion Excise reads between collapsed repeats
  • Truth
  • Misassembly Compressed Mates, Linking Mates

10
Mate-Happiness Insertion
  • Insertion Additional reads between flanking
    repeats
  • Truth
  • Misassembly Expanded Mates, Missing Mates

11
Mate-Happiness Inversion
  • Inversion Flipping of reads
  • Truth
  • Misassembly Misoriented Mates

12
Mate-Happiness Rearrangement
  • Rearrangement Reordering of reads
  • Truth
  • Misassembly Misoriented, Compressed, Stretched
    Mates

13
C/E Statistic
  • Easy to detect mis-oriented or missing mates, but
    individual compressed or expanded mates are
    expected.
  • Does the distribution of inserts spanning a given
    position differ from the rest of the library?
  • Record large differences as potential
    misassemblies
  • Even if each individual mate is happy
  • Compute the statistic at all positions
  • (Local Mean Global Mean) / (Global Stdev / vN)
  • gt 3 indicates significant expansion
  • lt -3 indicates significant compression
  • Introduced by Jim Yorkes group at UMD

14
Sampling the Genome
15
C/E-Statistic Expansion
2kb
4kb
6kb
0kb
8 inserts 3.2kb-6kb Local Mean 4461 C/E Stat
(4461-4000) 3.26 (400 / v8)
C/E Stat 3.0 indicates Expansion
16
C/E-Statistic Compression
2kb
4kb
6kb
0kb
8 inserts 3.2 kb-4.8kb Local Mean 3488 C/E
Stat (3488-4000) -3.62 (400
/ v8) C/E Stat -3.0 indicates Compression
17
2. Read Coverage
  • Find regions of contigs where the depth of
    coverage is unusually high
  • Collapsed Repeat Signature
  • Can detect collapse of 100 identical repeats
  • AMOS Tool analyzeReadDepth
  • 2.5x mean coverage

18
3. Normalized K-mers
  • Not all repeats are mis-assembled, but most
    mis-assemblies occur at repeats.
  • Detect repeats by noticing increased read k-mer
    coverage.
  • 2 copy repeat will have 2x the average read
    k-mer coverage
  • N copy repeat will have Nx the average read
    k-mer coverage
  • Normalize read k-mer coverage with consensus
    k-mer coverage to highlight just the repeats with
    missing copies.
  • The sequence of an N copy repeat should occur N
    times in the consensus.

19
Normalized K-mers
  • Correct assembly of 2 copy repeat

20
Normalized K-mers
  • Mis-assembly of 2 copy repeat
  • Effective even if reads are left as singletons

Normalized K-mers
21
4. Read Alignment
  • Multiple reads with same conflicting base are
    unlikely
  • 1x QV 30 1/1000 base calling error
  • 2x QV 30 1/1,000,000 base calling error
  • 3x QV 30 1/1,000,000,000 base calling error
  • Regions of correlated SNPs are likely to be
    assembly errors or interesting biological events
  • Highly specific metric
  • AMOS Tools analyzeSNPs clusterSNPs
  • Locate regions with high rate of correlated SNPs
  • Parameterized thresholds
  • Multiple positions within 100bp sliding window
  • 2 conflicting reads
  • Cumulative QV gt 40 (1/10000 base calling error)

A G C A G C A G C A G C A G C A G C C T A C T A
C T A C T A C T A
22
5. Read Breakpoints
375
  • Align singleton reads to consensus sequences.
  • A consistent breakpoint shared by multiple reads
    can indicate a collapsed repeat.
  • Initially developed to detect collapsed repeat in
    Bacillus Anthracis.

BAPDN53TF 786bp
665
BAPDF83TF 786bp
428
BAPCM37TR 697bp
668
BAPBW17TR 1049bp
144337
146944
16S rRNA
144203
146226
145515
147021
RA
RB
23
Performance
Combining signatures into suspicious regions
greatly improves specificity.
24
Hawkeye
25
Hawkeye Goals
  • Interactively explore and analyze
  • Libraries
  • Insert Sizes, Read Length, Inserts
  • Scaffolds Contigs
  • Sizes, Composition, Sequence
  • Multiple Alignment, SNP Barcode
  • Read Coverage, k-mer Coverage
  • Inserts
  • Happiness, Coverage, CE Statistic
  • Reads
  • Clear Range, Quality Values, Chromatograms
  • Features
  • Arbitrary regions of interest
  • Including Mis-assembly Signatures!!!

Overview Zoom Filter Details on Demand
26
Launch Pad
27
Histograms Statistics
Insert Size
Read Length
GC Content
Overall Statistics
  • Birds eye view of data and assembly quality

28
Scaffold View
  • Statistical Plots
  • Scaffold
  • Features
  • Inserts
  • Overview
  • Control Panel
  • Details

29
Insert Happiness
  • Happy
  • Oriented Correctly
  • Insert Size Library.mean lt Happy-Distance
    Library.sd
  • Stretched
  • Oriented Correctly
  • Insert Size gt Library.mean Happy-Distance
    Library.sd
  • Compressed
  • Oriented Correctly
  • Insert Size lt Library.mean - Happy-Distance
    Library.sd
  • Misoriented
  • Same or Outies
  • Linking
  • Reads mate is in some other scaffold

Both mates present
Only 1 read present
30
Standard Feature Types
  • B Breakpoint
  • Alignment ends at this position
  • C Coverage
  • Location of unusual mate coverage (asmQC)
  • S SNPs
  • Location of Correlated SNPs
  • U Unitig
  • Used to report location of surrogate unitigs in
    CA assemblies
  • X Other
  • All other Features

Loading Features loadFeatures bankname
featfile Featfile format Contigid type end5
end3 comment
31
Contig View
32
Contig View Expanded
Quality Values
Normalized Chromatogram
Chromatograms are loaded from specified
directories, or on demand from Trace Archive.
33
Assembly Reports
Contigs
Features
Reads
Scaffolds
  • Full Integration Double click takes you there

34
SNP View
SNP Sorted Reads
Polymorphism View
35
SNP Barcode
SNP Sorted Reads
Colored Rectangle indicate the positions and
composition of the SNPs
36
Scaffold View
Coverage
CE Statistic
SNP Feature
Happy
Stretched
Compressed
Misoriented
Linking
37
Collapsed Repeat
Read Coverage Spike
-5.5 CE Dip
Compressed Mates Cluster
68 Correlated SNPs
38
Confirmed Misassembly
Misassembly
Fixed
  • Collapsed repeat
  • Compressed mates (-5.5 CE Stat)
  • Correlated SNPs (68 Positions within 1400bp)
  • Spike in Read Coverage

39
Fixing the collapsed repeat
  • Select reads and mates in region of collapse.
  • AMOS findMissingMates, select-reads
  • Reassemble those reads with a stricter unitigger
    error rate.
  • AMOS minimus
  • Patch the collapsed region of the original
    assembly with corrected version.
  • AMOS stitchContigs

40
stitchContigs
Before
After
The multiple alignment of the patch replaces the
multiple alignment between the two stitch reads.
Otherwise, reads left of the left stitch read and
right of the right stitch read are taken directly
from the original master contig in their original
multiple alignment, so the master contig and the
stitched contig are identical except near the
compression point.
41
More Information
  • AMOS Webpages
  • http//amos.sourceforge.net/forensics
  • http//amos.sourceforge.net/hawkeye
  • Contact AMOS
  • amos-help at lists.sourceforge.net
  • Acknowledgements
Write a Comment
User Comments (0)
About PowerShow.com