slides

About This Presentation

Title:

slides

Description:

Repeats in the Genomes. Kmer Words Hashing and Distribution. Relational Matrix ... Euler (UCSD, USC) P.A. Pevzner, H. Tang, M.S. Waterman, RECOMB (2001) ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 24

Provided by: deptch

Category:

more less

Transcript and Presenter's Notes

Title: slides

1
WGS Assembly and Reads Clustering
Zemin Ning
Production Software Group Informatics Division
2
Outline of the Talk

Whole Genome Shotgun Sequencing
Insert Sizes
Repeats in the Genomes
Kmer Words Hashing and Distribution
Relational Matrix
Profile of unique kmer words
Phusion Steps
How to run Phusion parameter selections

3
WGS Sequencing The WGS method begins by
fragmenting the genome into many pieces of
various sizes. This fragmentation can be done in
several ways, including physically shaking the
DNA and cutting it with restriction enzymes.
Depending on the size of the resulting fragment,
various hosts are used to clone these regions.

Clone-by-Clone Sequencing
ADV. Easy assembly
DIS. Build library physical map
redundant sequencing
Whole Genome Shotgun (WGS)
ADV. No mapping, no redundant sequencing
DIS. Difficult to assemble and resolve repeats

4
Whole Genome Shotgun Sequencing
genome
plasmids (2 10 Kbp)
forward-reverse paired reads
known dist
cosmids (40 Kbp)
500 bp
500 bp
5
Automatic Sequencing

6
Base Calling - Phred
Idealized traces would consist of evenly spaced,
nonoverlapping peaks. Real traces deviate
from this ideal due to imper- fections of the
sequencing reactions, of gel electro-phoresis,
and of trace processing. The first 50 or so
peaks and peaks over 500 or so are particularly
noisy.

Quality high no ambiguities medium
some ambiguities Poor low confidence
7
Historical Context
1995 H.influenzae sequenced using TIGR by
Craig Venter. H. influenzae is the first free
living organism to be sequnced. It has roughly 2
million base pairs. The sequencing used a shotgun
method that assembled 25,000 fragments of 500 bp
each. 1997 Whole Genome Shotgun paper written
by Weber Meyers. This is the first time that a
shotgun method has been suggested for sequencing
the human genome. By this time, the public Human
Genome Project has already started using a
clone-by-clone method. 1997 Phil Green writes
review against WGS. 1998 Celera founded.
Celera entered into a competition with the public
Human Genome Project to sequence the human genome
first. Celeras main advantage was using the
Whole Genome Shotgun method, which had a chance
of failing, but if successful would produce
faster results.
8
1999 Fly genome (180Mbs) sequenced by Celera
using the Celera assembler. The genome is
available by subscription to Celeras
database 2001 Human Genome published. The
genome was sequenced using data from the public
Human Genome Project and Celera. The public
effort used the clone-by-clone method, while
Celera used the Whole Genome Shotgun method.
Celera gives access to the genome
through subscription to the database. The results
from the public project are free to
access. 2001 Mouse Genome sequenced by Celera
using the Whole Genome Shotgun method. It is made
available by Celera on a subscription
basis. 2002 The Mouse Genome published.
Whiteheads ARACHNE and Sangers Phusion were
involved.
9
Whole Genome Assemblers

TIGR Assembler G.G. Sutton et al., Genome Sci
Technol 1, 9-19 (1995) PHRAP P. Green
(1996) Celera Assembler CAP3 X. Huang, A. Madan,
Genome Res 9, 868-877 (1999) RePS J. Wang et al.
Genome Res 12, 824-831 (2002) Phusion
(Sanger) J.C. Mullikin, Z. Ning, Genome Res 13,
81-90 (2003) Arachne (Whitehead/MIT) Euler (UCSD,
USC) P.A. Pevzner, H. Tang, M.S. Waterman, RECOMB
(2001) most assemblers follow the same
approach overlap layout - consensus
10
Unique and Repetitive DNA Sections
A X B X
C
11
(No Transcript)
12
Kmer Word Hashing
Contiguous Base Hash K 12
Gap-Hash 4x3
13
Word use distribution for the mouse sequence data
at 7.5 fold
14
64 -2k
2k
15
Relation Matrix R(i,j) number of kmer words
shared between read i and read j
1 2 3 4 5 6
j N
227 0 0 0 0
1
2
227 187 0 0 0
3
0 187 0 170 0
4
0 0 0 0 213
Group 2 (4,6)
5
0 0 170 0 0
6
0 0 0 213 0
i
R(i,j)
Group 1 (1,2,3,5)
N
16
Relation Matrix R(i,j) Implementation
1 2 3 4 5 6
j 500
1
2
3
4
Number of shared kmer words (lt 63)
5
. . .
Read index
R(i,j)
N
17
Phusion Iterations Cutting The Weakest Link
18
(No Transcript)
19
(No Transcript)
20
Phusion Steps

Hashing the kmer words
Calculate kmer words distribution
Get the list of kmer words only use those occur
2-D times
Combine the kmer words with read index
Sort the combined list
Build up relational matrix
Group the reads
Output.

21
Phusion command line for Zfish ./phusion
kmer 18 depth 13 -fill
6 -gap 5 -match 6 -match2 6 -matrix
500 -break 1 -set 12000
mates fasta/fastq files
22
Phusion2 ?
23
Acknowledgements