Title: Novel multi-platform next generation assembly methods for mammalian genomes
1 Novel multi-platform next generation assembly
methods for mammalian genomes
- The Baylor College of Medicine, Australian
Government and University of Connecticut teamed
together (under the KanGO consortium) as part of
an international effort to generate a de novo
genome sequence for marsupial model species, the
tammar wallaby (M. eugenii). - An important model organism is part of the
Metatherian mammals and harbors unique life
history traits and genome features. - Genome sequencing was done at several
institutions , using several technologies, over
several years. - Developed a pipeline to integrate long and short
read sequences and existing assemblies using well
known mapping and assembly tools such as Bowtie
and Phrap.
2Overview of the Data
Technology Institution Read Length Reads Bases
Sanger BCM Avg. 915 bp 9,924,136 9,088,748,105
454 AGRF, UCONN Avg. 160 bp 1,530,592 275,951,386
Illumina AGRF 100 bp 271,875,064 27,187,506,400
Solid BCM 25 bp 710,427,490 18,471,114,740
Paired Read Overview
Read
Read
Insert
- All reads used in reassembly were paired.
- Read orientation not considered.
- The Solid reads had insert size of 1.395 kb.
- The Illumina reads had insert size of 3 and 8 kb
in roughly equal proportions. - The 454 reads had insert size of 8, 12, 20, and
30 kb with the majority of them split between 20
and 30 kb.
3Local Assembly Pipeline
Initial Assembly Sanger reads assembled using
Atlas (BCM). Initial scaffolding done using Solid
mate pairs.
- Map Short Reads
- To scaffolds using Bowtie, there are two cases
- Orphaned, when one mate is unmapped.
- Complete, when both are mapped.
Reads which map to multiple locations (non
unique) are not considered.
Key 454, illumina, sanger, unmapped read,
dotted line estimated distance
4Pipeline continued
Scaffold Assembly Feed contigs, complete and
orphaned pairs to Phrap. Re-assemble.
- Final Mapping
- Map all data to the final contigs.
7-383
97.7393333
97.7393333
7-383
378838334
378838377
Gap Restimation Of all gap distances in scaffolds
using complete pairs mapped on different contigs.
Quality Calling Map all reads and calculate a
quality of each base.
Output Contigs, Quality Scores, Scaffold (agp
file).
5Local Assembly Close-up
A
B
C
- Screen shot taken from Codon Code aligner which
uses Phrap to map reads. - Red and blue denote orientation.
- A Two contigs may be fused using short reads.
- B Contig will be extended with short reads at
one end. - C Single nucleotide and small errors are
corrected using the short reads higher coverage. - We are confident of changes since even at lt 10x
each read is itself uniquely mapped, or its
approximate position supported by a uniquely
mapped read.
6Assembly Comparison and Validation
Category Meug 1.0 Meug 1.1 Meug 1.2
Contigs (106) 1.21 1.17 1.101
N50 (103) 2.5 2.6 2.91
Bases (106) 2546 2536 2574
scaffolds 616418 277711 277711
Gaps (106) NA 539 614
RIKEN BACs Total Reads Total Bases Recovered Reads Recovered Bases
1.1 (original)
BAC 147312 113232882 34734 25662069
FOSMID 31250 18777544 4335 2263696
1.2 (updated)
BAC 147312 113232882 33328 24555624
FOSMID 31250 18777544 4294 2241059
7Gap Estimation Methods
- Gap between two contigs estimated using an
expectation maximization algorithm. Steps are
repeated until estimated parameters do not
change.
- Step1 Maximization compute gap estimate (x),
let the mean insertion length of N pairs equal µ
(initial value is library average).
- Step 2 Sampling, given x, and the length of
contigs, sample µ from completely mapped reads
spanning gaps.
8Gap Estimation Results
- When using libraries with different insert size
and std deviation it is necessary to bundle the
estimates. - The following is an example of how two libraries
are bundled - nx reads in libx, ex libx estimate, sx
lib std dev.
Simulation study of EM algorithm accuracy. Simulation study of EM algorithm accuracy. Simulation study of EM algorithm accuracy. Simulation study of EM algorithm accuracy. Simulation study of EM algorithm accuracy. Simulation study of EM algorithm accuracy.
ctg len \ gap 1000 2000 3000 4000 5000
10 -19 -7 2 -1 -15
100 66 87 92 88 74
200 161 188 192 188 173
500 446 492 492 487 469
800 733 795 794 784 764
1000 939 997 995 982 956
1200 1153 1196 1196 1177 1152
1500 1501 1493 1496 1469 1436
9Conclusion
Future Work
- This method is a viable way of improving existing
draft genomes with short read technologies at
limited (lt10x depth) coverage. - This method is robust and easily parallelized so
it is practical for large mammalian genomes.
- Better results may be obtained through multiple
iterations. - Re-scaffolding of the contigs should be done
between iterations. - A contig aware assembly algorithm could improve
local assembly performance.