Title: slides
1The Phusion Pipeline
Zemin Ning
Production Software Group Informatics Division
2Outline of the Talk
- The Phusion WGS Assembly System
- Before Reads Group (Phusion)
- RPphrap
- RPjoin
- RPono
- Supercontigs Tie Up to the Map
3(No Transcript)
4Insert Sizes Below is a chart of three hosts
used for cloning fragments and the sizes of
fragments that the hosts can accommodate. Plasmid
s 2-10 kbs Cosmids 40 kbs Bacterial
Artificial Chromosomes 70-200 kbs With each
these cloned fragments, reads are performed from
both sides. This method is also called
Double-Barrelled WGS. After this step, the
distance between the two reads on the fragment is
measured within 15 accuracy. The error rate for
paired reads is estimated at about 1.0.
5Fasta Format
gt20SNP44340-2210a02.p1cAGCCGGTACCCAGTTTTAGTTTTCCT
CTGAAGCAAGCACACCTTCCCTTTCCCGTCTGTCTATCCATCCCTGACCC
TGTTGTCTGTCTATCCCTGACCCCGTAGTCTCCTAAGTCGCCCCAGATT
TTGTGAACACCCTCTGGAACTAGAATCTAGTGGGCGGATGGACCATTTA
CTAGACGGAGGTAGAGGTGGGTGGATGCGACGACAGGGTGCATAGTCAGC
CCGGTTTTAAGGGCAGGTCACTTGGTAGGTCAGCAGGCGGGTCAGTGGG
CGGGTGCCTGCAGCATTTATGAACTTATTTGGCCCAGCAAACATTTTGA
GTGTCAGGCCGTGCCTACCCAAGGTGAGGGTAAGGAGCAAAATCAGCCCA
GCCCAGAGCACTGGGTGGCTACACAGAGCCGACCTCTAATGTGCGCTCC
GGGTCGGGATGGCACTCAGCTCGCCTTTAGGGAGTGATGATCTGGATGCC
TGGCTTGGAGGTGACAGAGCCTGCCCTTATGAGACAATTAAGAGACTGA
CTAAGCACCCGGCAGGAGGCCACGAGAATCCCCATGTGAGAAAGAAGAG
CATAAACAGGAAACACATTTAATAATTAAACAAAGATAACTCCCTCGTGT
GCGCGCACCGGGCCAGCCCCTATAGAAACATCTGAGGAGTCACTTCCTC
CCATGACTCTCGCCCGCCCGGCCGGCTGGAGTCGGCTCCTGGCAAGCTTC
AGgt20SNP44340-2210a02.q1c NNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGTTTTAGTTTTCCTCTGA
AGCAAGCACACCTTCCCTTTCCCGTCTGTCTATCCATCCCTGACCCTGT
TGTCTGTCTATCCCTGACCCCGTAGTCTCCTAAGTCGCCCCAGATTTTG
TGAACACCCTCTGGAACTAGAATCTAGTGGGCGGATGGACCATTTACTAG
ACGGAGGTAGAGGTGGGTGGATGCGAAGCACAGGGTGCATAGTCAGCCC
GGTTTTAAGGGCAGGTCACTTGGTAGGTCAGCAGGCGGGTCAGTGGGCGG
GTGCCTGCAGCATTTATGAACTTATTTGGCCCAGCAAACATTTTGAGTG
TCAGGCCGTGCCTACCCAAGGTGAGGGTAAGGAGCAAAATCAGCCCAGC
CCAGAGCACTGGGTGGCTACACAGAGCCGACCTCTAATGTGCGCTCCGGG
TCGGGATGGCACTCAGCTCGCCTTTAGGGAGTGATGATCTGGATGCCTG
GCTTGGAGGTGACAGAGCCTGCCCTTATGAGACAATTAAGAGACTGACT
AAGCACCCGGCAGGAGGCCACGAGAATCCCCATGTGAGAAAGAAGAGCAT
AAACAGGAAACACATTTAATAATTAAACAAAGATAACTCCCTCGTGTGC
GCGCACCGGGCCAGCCCCTATAGAAACATCTGAGGAGTCACTTCCTCCCA
TGACTCTCGCCCGCCCGGCCGGCTGGAGTCGGCTCCTGGCAAGCTTCAG
GCACCTCAGTTGTCCTGAATACACACAGCACCCTTTgt20SNP44340-2
210a03.p1c NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNAGTTTTAGTTTTCCTCTGAAGCAAGCACACCTT
CCCTTTCCCGTCTGTCTATCCATCCCTGACCCTGTTGTCTGTCTATCCCT
GACCCCGTAGTCTCCTAAGTCGCCCCAGATTTTGTGAACACCCTCTGGA
ACTAGAATCTAGTGGGCGGATGGACCATTTACTAGACGGAGGTAGAGGTG
GGTGGATGCGAACGACAGGGTGCATAGTCAGCCCGGTTTTAAGGGCAGG
TCACTTGGTAGGTCAGCAGGCGGGTCAGTGGGCGGGTGCCTGCAGCATT
TATGAACTTATTTGGCCCAGCAAACATTTTGAGTGTCAGGCCGTGCCTAC
CCAAGGTGAGGGTAAGGAGCAAAATCAGCCCAGCCCAGAGCACTGGGTG
GCTACACAGAGCCGACCTCTAATGTGCGCTCCGGGTCGGGATGGCACTC
AGCTCGCCTTTAGGGAGTGATGATCTGGATGCCTGGCTTGGAGGTGACAG
AGCCTGCCCTTATGAGACAATTAAGAGACTGACTAAGCACCCGGCAGGA
GGCCACGAGAATCCCCATGTGAGAAAGAAGAGCATAAACAGGAAACACAT
TTAATAATTAAACAAAGATAACTCCCTCGTGTGCGCGCACCGGGCCAGC
CCCTATAGAAACATCTGAGGAGTCACTTCCTCCCATGACT
6gt20SNP44340-2210a02.p1c 6 8 8 8 15 17 17 17
12 12 20 20 29 31 34 34 38 38 40 40 49 49 37 33
33 33 33 30 31 24 24 34 45 45 45 45 38 38 38 45
40 40 40 40 40 40 40 40 40 40 40 37 45 43 45 45
45 40 37 37 40 40 40 45 45 45 45 45 45 45 45 49
45 45 45 45 45 45 45 45 40 45 45 45 42 37 37 40
40 37 45 45 42 37 37 37 45 45 49 49 33 33 33 33
33 33 40 37 40 40 45 45 45 40 40 40 45 45 45 45
49 49 49 49 45 49 45 45 45 45 40 40 43 43 43 40
40 40 37 40 49 49 40 40 37 37 37 42 45 40 40 40
40 37 42 45 45 45 45 45 45 45 45 45 45 49 49 49
49 45 45 45 45 45 38 38 38 45 45 45 33 33 33 33
33 33 37 37 40 45 40 49 49 49 49 49 49 49 36 36
36 29 18 8 6 6 6 10 18 34 34 31 31 34 38 40
40 40 40 38 40 38 45 38 45 45 49 49 49 49 49 49
49 49 49 49 49 49 49 45 40 40 40 40 40 40 37 40
45 45 45 45 40 37 37 37 42 40 40 33 33 33 33 33
40 37 37 37 42 37 45 45 45 45 45 37 43 45 42 36
33 33 33 33 40 40 40 40 42 40 45 45 45 45 40 40
40 40 40 40 35 37 37 45 37 37 40 40 40 40 45 44
38 40 40 40 40 35 33 33 30 30 30 33 33 30 30 30
33 45 45 37 36 36 36 37 35 35 35 35 30 33 30 33
28 28 28 27 27 25 19 19 27 27 33 33 36 36 38 38
33 36 36 34 34 40 40 40 34 31 33 31 34 34 34 34
34 34 38 38 45 45 49 49 49 49 45 45 37 34 34 34
34 34 34 33 33 33 33 33 40 49 49 45 45 40 35 36
36 34 34 37 40 40 40 40 45 37 37 40 40 36 36 35
36 36 36 36 36 33 33 27 27 21 19 19 27 33 33 34
36 36 36 36 38 36 36 40 33 35 37 45 45 45 37 37
38 38 38 45 37 45 37 45 40 40 40 40 37 37 37 37
37 37 45 38 38 38 38 38 38 45 45 45 45 40 40 37
37 43 43 37 36 36 35 35 35 37 37 37 37 37 45 45
40 40 37 37 37 37 43 36 34 34 34 34 37 45 34 34
34 34 34 29 29 22 22 22 30 37 37 37 37 34 34 34
37 33 36 34 34 34 37 37 37 38 37 35 34 32 31 28
28 31 31 34 37 37 37 37 34 35 35 35 37 37 37 37
37 40 40 38 38 36 36 33 33 28 27 24 23 19 24 27
27 33 33 28 28 30 30 30 30 30 36 34 34 30 22 22
21 17 19 29 32 36 36 25 19 19 16 12 10 16 15 12
12 18 16 14 14 25 23 22 22 24 19 12 25 10 10 12
21 21 25 24 24 29 29 29 39 39 26 26 24 21 17 17
17 19 19 24 14 19 9 9 11 18 10 8 9 9 10 9
8 14 12 13 15 10 11 14 12 12 8 8 8 8 12 8
8 8 10 9 4
7Fastq Format
_at_20SNP45079-1505a04.p1c bases 1 to
616GCAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGCAATT
CTATGATGTCCCAAATAAAATATTGAGTTTAGAGATTTGTCTCTATATT
TTATTTGATGTTGACAACCACATTTGAAAACAGGAAAAAAAAACTGTTT
CTATAAGGGATTGAGCTCATGGGCTAGTTGTCTTCAAAGTAGAACAAAC
TTTTGGTTCTGTTGGAGTCTCTGAATAACAAAATCTCCAATTGAACTTCA
CATTCATTAGAACTAACAATTTGAATACTTTGCCCATCATTAACCATAA
TTCATTTTTTGAGATACAGCAGAATAGCCTCCTGGAGTTCTAACTCCTC
TCTCTTCCCAAGCAAAGGAGCTCCTGCAGAATCCATGGATTCAGTAATA
AGCATGAGAAGAAAGACGTAATTGTGCCGCAGAAGTCAGGAGGCCATANG
AGCANGTTGGGAATGACTGTGGTTCTTGAGCCTTTCGCCTACCACTGCC
CCTTCTTCATACCCTTGAGCAGCCTTTCTTTTTTTAATCAATGGAGATT
TTAAGAAGAGGAGACTGCTTTCCTAGATTTCAGAATACATGTGGAAGCA
TTTTATATTTCCCTGCTCTACTGTCTTAA20SNP45079-1505a04.
p1c bases 1 to 616!1)04ltEEACltgtgtgtgtgtgtgtgtgtgtgtgtltlt45/.
''47gtgtgtgtC?DCCCCCHB54)''.4ltACCCCCDDDHHHTTKFDDgtHAA
gtDBDHFFFFFGKKKKKFGIINNNNOYMDDDDAAACNTQQMMFFDD
ltltCKQYYYYYQQDDDOFIIII9B4400ltBK_at__at_DD4442ltQQQIIlt42
066gtgtgtECIYYYYQQ???MMFFFKFFBBKQQQDFFFFIIIHYPP
HHIIFFMQHHBDDFFHDDD94/-003249ADDITTYQQ?DDIIIYQQFF
DDDDDDDODDDDDDNCCAAADYYYYYQDADltgtgtgtgt7ltltBBKKLTON
9?lt9gtFHDDFEOYYYYYYYYOOBB9998840255/39gt4/,,6?CKKK
BO633-829AAHDDltgt7gtgtDQQPIIBIAAACIIAF_at_91911662
4/)))0...660228_at_8ltgtgtAgtgtgtI???gt?AA388AFlt
60--45ltgtgtgt99333029IIgtgt?9ltgtgtgtgt533688668gtgtII_at_QCCQA
33.0.156-1ltlt700,-//846225/lt))(-0)))
--466842-.0045gtlt863/57725888gtgt88...)((
8Reads Naming
9Procedures Before and After Reads Grouping
- Reads Quality Clipping (ssahaCLIP) - optional
- Remove small sections where quality is low
- Reads Vector Screening (ssaha_screen)
- Remove small sections of vector contamination
sequences - Make a read pair file mates (ssahaMates)
- From the read name list, make a file to mark all
the read pairs - Phusion reads grouping
- Make small files for grouped reads (fmate)
- Put DNA bases and quality values into small
files for which Phrap can handle with - RPphrap on a Computational Farm
10Unique and Repetitive DNA Sections
A X B X
C
11(No Transcript)
12RPphrap Using Read Pairs to Close Gaps
13RPjoin Join those contigs with shared reads
14Building Supercontigs
Also called scaffolds.
15RPono - Build up of Supercontigs
16The rock-phase placed unitigs that were
consistently positioned by at least two mate
pairs, the stone-phase placed unitigs that were
positioned by a single mate pair and confirmable
by an overlap tiling across the gap containing
it, and the pebble-phase attempted to find the
best tiling across gaps using a quality-value
based measure of significance.
17Map Tying - Placing the assembly on the genome
- A sequence tag is a short sequence that is unique
among the whole genome. - Genetic map contains many sequence tags and their
locations. - Align the super contigs to the genome according
to the tags.
18- Zebrafish as a model organism
- Danio rerio
- Fish length 3 cm long
- Estimated genome size 1.55 Gb
- Easy to maintain
- short generation time
- can be kept at high densities
- Easy to manipulate
- external fertilisation and development
- transparent embryos
- Sanger Institute WGS project started in spring
2001 - DNA source Tuebingen embryos
- WGS read Insert sizes 2 - 10 kb
- BACends insert sizes 165 175 kb
- Polymorphism 1000 5 day old embryos
- SNP density One in every 200 bps
- Indel density One in every 1500 bps
- Indel length 2 30 bps.
19Gap-Hash Pattern used for zfish
kmer 18 Fill 6 Gap 5 kmer covers 26 bps
Some more complicated pattern could be tested
later, like those used in PatternHunter
20Zebrafish WGS Assembly
WGS reads Number of shotgun reads 11.7
MillionEstimated genome size 1.55 Gbp Total
number of bases 7.64 Gbp Estimated read
coverage 4.5XNumber of reads placed
9,953,938 Ratio of placed reads 84.8
Assembly features - contig statsTotal
number of contigs 430,985 Total
bases of contigs 1.31 GB N50
contig size 4,451 Averaged
contig size 3,030 Contig
coverage over the genome 79.8
21Zebrafish WGS Assembly (2)
Assembly - supercontig statsTotal number of
supercontigs 104666 Total bases of
supercontigs 1.39 GB N50 supercontig
size 68,456 Averaged
supercontig size 13,325 Supercontig
coverage over the genome 93.5 AGP contigs
after supercontigs tied with FPC map Total number
of AGP contigs 83470 Total bases of AGP
contigs 1.45 Gbp N50 AGP contig size
296,896 Averaged AGP contig size
17,398 AGP contig coverage over the
genome 95.97
22Cross Genome Comparison
23Conservation of synteny between human and mouse
A typical 510-kb segment of mouse chromosome 12
that shares common ancestry with a 600-kb section
of human chromosome 14 is shown. Blue lines
connect the reciprocal unique matches in the two
genomes. The cyan bars represent sequence
coverage in each of the two genomes for the
regions. In general, the landmarks in the mouse
genome are more closely spaced, reflecting the
14 smaller overall genome size.
The mouse genome. Nature 420, 520 - 562
24Phusion for Cross Genome Comparison
Single Copy Genome
Comparison Genome
1 2 3 . . . 4k
1 2 3 . . . 4k
1
2
Single Copy Hash Head
Total Hash Head
25Acknowledgements
- Jim Mullkin
- Tony Cox
- Richard Durbin
- Jane Rogers
- Young GU
- Adam Spargo
- Will Spooner
- Mark Rae
- Steven Leonard
- Sanger Systems Support
- Sanger Sequencing Facilities