Genome Assembly - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Genome Assembly

Description:

... .00 51.00 377044.00 23012.00 196.00 80.00 62631932.00 275273.00 40135.00 325.00 138833812.00 138833812.00 275273.00 24566.00 46535500.00 31.00 3796782.00 15553 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 37
Provided by: HankGl
Category:
Tags: assembly | genome

less

Transcript and Presenter's Notes

Title: Genome Assembly


1
  • Genome Assembly
  • and Finishing
  • Alla Lapidus, Ph.D.
  • Associate Professor
  • Fox Chase Cancer Center

2
A typical Microbial (and not only) project
Sequencing
Draft assembly
Goals Completely restore genome Produce high
quality consensus
FINISHING
Annotation
Public release
3
Sequencing Technology at a Glance
4
Evolution of Microbial Drafts
  • Sanger only
  • 4x of 3kb plasmids 4x of 8kb plasmids 1x of
    fosmids
  • 50k for 5MB genome draft
  • Hybrid Sanger/pyrosequence/Illumina
  • 4x 8kb Sanger 15 x coverage 454 shotgun 20x
    Illumina (quality improvement)
  • 35k for 5MB genome draft

454 Solexa - 20x coverage 454 standard 4x
coverage 454 paired end (PE) 50x coverage
Illumina shotgun (quality improvement gaps) -
10k per 5MB genome
Solexa only - low cost too fragmented good
assembler is needed!
Solexa PacBio - low cost better sachffolding
5
Process Overview
6
Library Preparation - Sanger
DNA fragmentation
Random fragment DNA
7
Library Preparation - new
8
Assembly (assembler)
  • Sanger reads only (phrap, PGA, Arachne)
  • 454/Solexa (Newbler, PCAP, Velvet, ALLPATH etc)

9
Draft assembly - what we get
Assembly set of contigs
10
16
21
Ordered sets of contigs (scaffolds)
10
21
PE
Clone walk (Sanger lib)
New technologies no clones to walk off even if
you can scaffold contigs (bPCR new approach of
gap closing)
10
Primer walking
Clone walk (captured gaps)
Clone A
11
Why do we have gaps
What are gaps ? - Genome areas not covered by
random shotgun
  • Sequencing coverage may not span all regions of
    the genome, thus producing gaps in the assembly
    colony picking
  • Assembly results of the shotgun reads may produce
    misassembled regions due to repetitive sequences
    (new and old tech)
  • A biased base content (this can result in failure
    to be cloned, poor stability in the chosen
    host-vector system, or inability of the
    polymerase to reliably copy the sequence)
  • AT-rich DNA clones poorly in bacteria
    (cloning bias
  • promoters like structures Sanger )gt
    uncaptured gaps
  • GC rich DNA is difficult to PCR and to
    sequence and often
  • requires the use of special chemistry gt
    captured gaps
  • high AT and GC content caused by
    problematic PCR (new tech)

12
Assembling repeats
13
High GC sequencing problems
  • The presence of small hairpins (inverted repeat
    sequences) in the
  • DNA that re anneal ether during sequencing or
    electrophoresis
  • resulting in failed sequencing reactions or
    unreadable electrophoresis
  • results. (This can be aided by adding modifiers
    to the reaction,
  • sequencing smaller clones and running gels at
    higher temperatures in
  • the presence of stronger denaturants).

14
Why more than one platform?
  • 454 - high quality reliable skeletons of genomes
    (454 std 454 PE) correctly assembled contigs
    problems with repeats (unassembled or assembled
    in contigs outside of main scaffolds)
    homopolymer related frame shifts
  • Illumina data is used to help improve the overall
    consensus quality, correct frameshifts and to
    close secondary structure related gaps not ready
    for de-novo assembly of complex genomes (too many
    gaps!)
  • Sanger finishing reads fosmids larger
    repeats and templates for primer walk less cost
    effective but very useful in many cases

15

454 (pyrosequence) and low GC genomes Thermotoga
lettingae TMO
Sanger based draft assembly - 55 total contigs
41 contigs gt2kb - 38GC - biased Sanger libraries
Draft assembly 454 - 2 total contigs 1 contigs
gt2kb - 454 no cloning
lt166bpgt - average length of gaps
16
454 and High GC projects
Xylanimonas cellulosilytica DSM 15894 (3.8 MB
72.1 GC)
PGA assembly - 9x of 8kb 454
PGA assembly - 9x of 8kb
Assembly Total contigs Major contigs Scaffolds Misassenblies N50
PGA-8kb 210 166 4 165 41,048
PGA-8kb454 33 23 2 14 288,369
17
NextGen high Quality Drafts at JGI (multiple
sequencing platforms)
Unassembled 454 reads
Solexa contig
454/Sanger contig
Fosmid ends and 454 PE
1.Pyrosequence and Sanger to obtain main ordered
and oriented part of the assembly Newbler
assembler
2. GapResolution (in house tool) to close some
(up to 40) gaps using unassembled 454 data
PGA or Newbler assemblers
3. Solexa reads to detect and correct errors in
consensus in house created tool (the Polisher)
and close gaps (Velvet)
Fosmids ends not used for microbes
18
Solving gaps gapResopution tool
19
Solving gaps gapResopution tool (II)
Step 3 If gap is not closed, tool designs
designs primers for sequencing reactions
Step 4 Iterate as necessary (in sub-assemblies)
http//www.jgi.doe.gov/degilbert_at_lbl.gov
20
Solexa for gaps
  • Velvet assembly
  • Blast Velvet contigs against Newbler ends
  • Use proper Velvet contigs to close gaps

Velvet contig
454 Contig
Gap
Illumina reads
Velvet contigs close gaps caused by hairpins and
secondary structures
21
Low quality areas areas of potential
frameshifts
Assemblies contain low quality regions (red tags)
22
Homopoymer related frameshifts
Frameshift 1 (AAAAA, should be AAAA)
homopolymers (ngt3)
Frameshift 2 (CCCC, should be CCC)
Modified from N. Ivanova (JGI)
23
Polisher software for consensus quality
improvement
24
Errors corrected by Solexa
Frame shift detected (454 contig)
CCTCTTTGATGGAAATGATATCTTCGAGCATCGCCTCGGGTT
TTCCATACAGAGAACCTTTGATGATGAACCGGTTGAAGATCTGCGGGTCA
AA CCTCTTTGATGGAAATAATATATTCGAGCATC
TTAGTGGAAATGATATCTTCGAGCATCGCCTC
CGAGCNTCGCCTCGGGCTTTCCCT
CGAGCATCGCCTCGGGTTCTCCATA
CACAGA
GCATCGCCTCGGGTTTTCAATACAGAGAACCT
CAGCGCCTCGGGTTTTCCATACAGAGAAC
CTT
ATCGCCTCGGGTTTTCCAGACAGAGAACCTTT
GGTTCGGGTTTTCCATACAGAGAAC
CTTTGAT
GTTTTCCATACAGAGAACATTTGATGATGAAC

GTTGTCCATACAGAGAACTTTTGATGATGAAC
TATANCATACAGAGAACCT
TTGATGATGAACC
ATTTCCAGACAGAGAACCNTTGATGATGAACC

CAAACAGAGAACCTTTGAGGATGAACCGGTTG

ACAGGGAACCTTAGATGATGAACCGGTTGAAG

ACAGAGAACCTTAGATGATGAACCGGTTGAAG

ACCGTTGATGATGAACCGGTTGAAGATCTGCG

GATGGTGAACGGGTTGAAGATCTGCGGGTCAA

GGTTTGAAGATCTGCGGGTCAAACCAGTCCTC

GGTGGAAGATCTGCGGGTAAAAC
CAGTCCTCT

GGT.GNAGAGCTGCGGGTCAAACCAGTCCTCTG

TGAAGATCTGCGGTTCAAACCAGTCCTCTCCC


GATCGGCGTGTCAAACCAGTCCTCTGCCTCGT

TCTGCGGGTCAAACCAGTACTCTGCCTC
GTTC
25
So, what is Finishing?
  • The process of taking a rough draft assembly
    composed of
  • shotgun sequencing reads, identifying and
    resolving miss
  • assemblies, sequence gaps and regions of low
    quality to
  • produce a highly accurate finished DNA sequence.

Final quality
Final error rate should be less than 1 per 50
Kb. No gaps, no misassembled areas, no characters
other than ACGT
26
Genome projectsArchaea Bacteria only
http//www.genomesonline.org/
27
Metagenomic assembly and Finishing
The whole-genome shotgun sequencing approach was
used for a number of microbial community
projects, however useful quality control and
assembly of these data require reassessing
methods developed to handle relatively uniform
sequences derived from isolate microbes.
  • Typically size of metagenomic sequencing project
    is very large
  • Different organisms have different coverage.
    Non-uniform sequence coverage results in
    significant under- and over-representation of
    certain community members
  • Low coverage for the majority of organisms in
    highly complex communities leads to poor (if any)
    assemblies
  • Chimerical contigs produced by co-assembly of
    sequencing reads originating from different
    species.
  • Genome rearrangements and the presence of mobile
    genetic elements (phages, transposons) in closely
    related organisms further complicate assembly.
  • No assemblers developed for metagenomic data sets

28
QC Annotation of poor quality sequence
To avoid this -make sure you use high quality
sequence -choose proper assembler
A Bioinformatician's Guide to Metagenomics .
Microbiol Mol Biol Rev. 2008 December 72(4)
557578.
29
Assembly mistakes
A Bioinformatician's Guide to Metagenomics.
Microbiol Mol Biol Rev. 2008 December 72(4)
557578.
30
Recommendations for metagenomic assembly
  • Use Trimmer (Lucy etc) to treat reads PRIOR to
    assembly
  • None of the existing assemblers designed for
    metagenomic data but assemblers like PGA work
    better with paired reads information and produce
    better assemblies.
  • We currently test Newbler assembler for second
    generation sequencing 454 only and 454/Solexa
    co-assembly

31
Metagenomic finishing approach
Candidatus Accumulibacter phosphatis (CAP)
Binning Which DNA fragment derived from which
phylotype? (BLAST GC read depth)
Lucy/PGA
Complete genome of Candidatus Accumulibacter
phosphatis
45
32
Few more details read quality
33
(No Transcript)
34
Merged assemblies ( k31 and k51) with
minimus(Cloneview used for visualization)
  • Green k31
  • Purple k51

Illumina only data
35
Stats for 31, 51 and merged 31-51 assemblies
36
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com