Title: Genome Assembly
1- Genome Assembly
- and Finishing
- Alla Lapidus, Ph.D.
- Associate Professor
- Fox Chase Cancer Center
2A typical Microbial (and not only) project
Sequencing
Draft assembly
Goals Completely restore genome Produce high
quality consensus
FINISHING
Annotation
Public release
3Sequencing Technology at a Glance
4Evolution of Microbial Drafts
- Sanger only
- 4x of 3kb plasmids 4x of 8kb plasmids 1x of
fosmids - 50k for 5MB genome draft
- Hybrid Sanger/pyrosequence/Illumina
- 4x 8kb Sanger 15 x coverage 454 shotgun 20x
Illumina (quality improvement) - 35k for 5MB genome draft
454 Solexa - 20x coverage 454 standard 4x
coverage 454 paired end (PE) 50x coverage
Illumina shotgun (quality improvement gaps) -
10k per 5MB genome
Solexa only - low cost too fragmented good
assembler is needed!
Solexa PacBio - low cost better sachffolding
5Process Overview
6Library Preparation - Sanger
DNA fragmentation
Random fragment DNA
7Library Preparation - new
8Assembly (assembler)
- Sanger reads only (phrap, PGA, Arachne)
- 454/Solexa (Newbler, PCAP, Velvet, ALLPATH etc)
9Draft assembly - what we get
Assembly set of contigs
10
16
21
Ordered sets of contigs (scaffolds)
10
21
PE
Clone walk (Sanger lib)
New technologies no clones to walk off even if
you can scaffold contigs (bPCR new approach of
gap closing)
10Primer walking
Clone walk (captured gaps)
Clone A
11Why do we have gaps
What are gaps ? - Genome areas not covered by
random shotgun
- Sequencing coverage may not span all regions of
the genome, thus producing gaps in the assembly
colony picking - Assembly results of the shotgun reads may produce
misassembled regions due to repetitive sequences
(new and old tech) - A biased base content (this can result in failure
to be cloned, poor stability in the chosen
host-vector system, or inability of the
polymerase to reliably copy the sequence) - AT-rich DNA clones poorly in bacteria
(cloning bias - promoters like structures Sanger )gt
uncaptured gaps - GC rich DNA is difficult to PCR and to
sequence and often - requires the use of special chemistry gt
captured gaps - high AT and GC content caused by
problematic PCR (new tech)
12Assembling repeats
13High GC sequencing problems
- The presence of small hairpins (inverted repeat
sequences) in the - DNA that re anneal ether during sequencing or
electrophoresis - resulting in failed sequencing reactions or
unreadable electrophoresis - results. (This can be aided by adding modifiers
to the reaction, - sequencing smaller clones and running gels at
higher temperatures in - the presence of stronger denaturants).
14Why more than one platform?
- 454 - high quality reliable skeletons of genomes
(454 std 454 PE) correctly assembled contigs
problems with repeats (unassembled or assembled
in contigs outside of main scaffolds)
homopolymer related frame shifts - Illumina data is used to help improve the overall
consensus quality, correct frameshifts and to
close secondary structure related gaps not ready
for de-novo assembly of complex genomes (too many
gaps!) - Sanger finishing reads fosmids larger
repeats and templates for primer walk less cost
effective but very useful in many cases
15 454 (pyrosequence) and low GC genomes Thermotoga
lettingae TMO
Sanger based draft assembly - 55 total contigs
41 contigs gt2kb - 38GC - biased Sanger libraries
Draft assembly 454 - 2 total contigs 1 contigs
gt2kb - 454 no cloning
lt166bpgt - average length of gaps
16454 and High GC projects
Xylanimonas cellulosilytica DSM 15894 (3.8 MB
72.1 GC)
PGA assembly - 9x of 8kb 454
PGA assembly - 9x of 8kb
Assembly Total contigs Major contigs Scaffolds Misassenblies N50
PGA-8kb 210 166 4 165 41,048
PGA-8kb454 33 23 2 14 288,369
17NextGen high Quality Drafts at JGI (multiple
sequencing platforms)
Unassembled 454 reads
Solexa contig
454/Sanger contig
Fosmid ends and 454 PE
1.Pyrosequence and Sanger to obtain main ordered
and oriented part of the assembly Newbler
assembler
2. GapResolution (in house tool) to close some
(up to 40) gaps using unassembled 454 data
PGA or Newbler assemblers
3. Solexa reads to detect and correct errors in
consensus in house created tool (the Polisher)
and close gaps (Velvet)
Fosmids ends not used for microbes
18Solving gaps gapResopution tool
19Solving gaps gapResopution tool (II)
Step 3 If gap is not closed, tool designs
designs primers for sequencing reactions
Step 4 Iterate as necessary (in sub-assemblies)
http//www.jgi.doe.gov/degilbert_at_lbl.gov
20Solexa for gaps
- Velvet assembly
- Blast Velvet contigs against Newbler ends
- Use proper Velvet contigs to close gaps
Velvet contig
454 Contig
Gap
Illumina reads
Velvet contigs close gaps caused by hairpins and
secondary structures
21Low quality areas areas of potential
frameshifts
Assemblies contain low quality regions (red tags)
22Homopoymer related frameshifts
Frameshift 1 (AAAAA, should be AAAA)
homopolymers (ngt3)
Frameshift 2 (CCCC, should be CCC)
Modified from N. Ivanova (JGI)
23Polisher software for consensus quality
improvement
24Errors corrected by Solexa
Frame shift detected (454 contig)
CCTCTTTGATGGAAATGATATCTTCGAGCATCGCCTCGGGTT
TTCCATACAGAGAACCTTTGATGATGAACCGGTTGAAGATCTGCGGGTCA
AA CCTCTTTGATGGAAATAATATATTCGAGCATC
TTAGTGGAAATGATATCTTCGAGCATCGCCTC
CGAGCNTCGCCTCGGGCTTTCCCT
CGAGCATCGCCTCGGGTTCTCCATA
CACAGA
GCATCGCCTCGGGTTTTCAATACAGAGAACCT
CAGCGCCTCGGGTTTTCCATACAGAGAAC
CTT
ATCGCCTCGGGTTTTCCAGACAGAGAACCTTT
GGTTCGGGTTTTCCATACAGAGAAC
CTTTGAT
GTTTTCCATACAGAGAACATTTGATGATGAAC
GTTGTCCATACAGAGAACTTTTGATGATGAAC
TATANCATACAGAGAACCT
TTGATGATGAACC
ATTTCCAGACAGAGAACCNTTGATGATGAACC
CAAACAGAGAACCTTTGAGGATGAACCGGTTG
ACAGGGAACCTTAGATGATGAACCGGTTGAAG
ACAGAGAACCTTAGATGATGAACCGGTTGAAG
ACCGTTGATGATGAACCGGTTGAAGATCTGCG
GATGGTGAACGGGTTGAAGATCTGCGGGTCAA
GGTTTGAAGATCTGCGGGTCAAACCAGTCCTC
GGTGGAAGATCTGCGGGTAAAAC
CAGTCCTCT
GGT.GNAGAGCTGCGGGTCAAACCAGTCCTCTG
TGAAGATCTGCGGTTCAAACCAGTCCTCTCCC
GATCGGCGTGTCAAACCAGTCCTCTGCCTCGT
TCTGCGGGTCAAACCAGTACTCTGCCTC
GTTC
25So, what is Finishing?
- The process of taking a rough draft assembly
composed of - shotgun sequencing reads, identifying and
resolving miss - assemblies, sequence gaps and regions of low
quality to - produce a highly accurate finished DNA sequence.
Final quality
Final error rate should be less than 1 per 50
Kb. No gaps, no misassembled areas, no characters
other than ACGT
26Genome projectsArchaea Bacteria only
http//www.genomesonline.org/
27Metagenomic assembly and Finishing
The whole-genome shotgun sequencing approach was
used for a number of microbial community
projects, however useful quality control and
assembly of these data require reassessing
methods developed to handle relatively uniform
sequences derived from isolate microbes.
- Typically size of metagenomic sequencing project
is very large - Different organisms have different coverage.
Non-uniform sequence coverage results in
significant under- and over-representation of
certain community members - Low coverage for the majority of organisms in
highly complex communities leads to poor (if any)
assemblies - Chimerical contigs produced by co-assembly of
sequencing reads originating from different
species. - Genome rearrangements and the presence of mobile
genetic elements (phages, transposons) in closely
related organisms further complicate assembly. - No assemblers developed for metagenomic data sets
28QC Annotation of poor quality sequence
To avoid this -make sure you use high quality
sequence -choose proper assembler
A Bioinformatician's Guide to Metagenomics .
Microbiol Mol Biol Rev. 2008 December 72(4)
557578.
29Assembly mistakes
A Bioinformatician's Guide to Metagenomics.
Microbiol Mol Biol Rev. 2008 December 72(4)
557578.
30Recommendations for metagenomic assembly
- Use Trimmer (Lucy etc) to treat reads PRIOR to
assembly - None of the existing assemblers designed for
metagenomic data but assemblers like PGA work
better with paired reads information and produce
better assemblies. - We currently test Newbler assembler for second
generation sequencing 454 only and 454/Solexa
co-assembly
31Metagenomic finishing approach
Candidatus Accumulibacter phosphatis (CAP)
Binning Which DNA fragment derived from which
phylotype? (BLAST GC read depth)
Lucy/PGA
Complete genome of Candidatus Accumulibacter
phosphatis
45
32Few more details read quality
33(No Transcript)
34Merged assemblies ( k31 and k51) with
minimus(Cloneview used for visualization)
Illumina only data
35Stats for 31, 51 and merged 31-51 assemblies
36