DNA Sequencing and Assembly - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

DNA Sequencing and Assembly

Description:

Polymorphism rate: number of letter changes between two different ... Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu. Hierarchical Sequencing ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 44
Provided by: Sera4
Category:

less

Transcript and Presenter's Notes

Title: DNA Sequencing and Assembly


1
DNA Sequencingand Assembly
2
DNA sequencing
  • How we obtain the sequence of nucleotides of a
    species

ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
3
Which representative of the species?
  • Which human?
  • Answer one
  • Answer two it doesnt matter
  • Polymorphism rate number of letter changes
    between two different members of a species
  • Humans 1/1,000 1/10,000
  • Other organisms have much higher polymorphism
    rates

4
DNA sequencing vectors
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)


5
Different types of vectors
6
DNA sequencing gel electrophoresis
  • Start at primer
  • (restriction site)
  • Grow DNA chain
  • Include dideoxynucleoside
  • (modified a, c, g, t)
  • Stops reaction at all
  • possible points
  • Separate products with
  • length, using
  • gel electrophoresis

7
Electrophoresis diagrams
8
Output of gel electrophoresis a read
  • A read 500-700 nucleotides
  • A C G A A T C A G . A
  • 16 18 21 23 25 15 28 30 32 21
  • Quality scores -10?log10Prob(Error)
  • Reads can be obtained from leftmost, rightmost
    ends of the insert
  • Double-barreled sequencing
  • Both leftmost rightmost ends are sequenced

9
Method to sequence segments longer than 500
genomic segment
cut many times at random (Shotgun)
Get one or two reads from each segment
500 bp
500 bp
10
Reconstructing the Sequence (Fragment Assembly)
reads
Cover region with 7-fold redundancy (7X)
Overlap reads and extend to reconstruct the
original genomic region
11
Definition of Coverage
C
  • Length of genomic segment L
  • Number of reads n
  • Length of each read l
  • Definition Coverage C nl/L
  • How much coverage is enough?
  • (Lander-Waterman model)
  • Assuming uniform distribution of reads, C10
    results in 1 gapped region /1,000,000 nucleotides

12
Challenges with Fragment Assembly
  • Sequencing errors
  • 1-2 of bases are wrong
  • Repeats
  • Computation O( N2 ) where N reads

false overlap due to repeat
13
Repeats
  • Bacterial genomes 5
  • Mammals 50
  • Repeat types
  • Low-Complexity DNA (e.g. ATATATATACATA)
  • Microsatellite repeats (a1ak)N where k 3-6
  • (e.g. CAGCAGTAGCAGCACCAG)
  • Common Repeat Families
  • SINE (Short Interspersed Nuclear Elements)
  • (e.g. ALU 300-long, 106 copies)
  • LINE (Long Interspersed Nuclear Elements)
  • 500-5,000-long, 200,000 copies
  • MIR
  • LTR/Retroviral
  • Other
  • -Genes that are duplicated then diverge
    (paralogs)
  • -Recent duplications, 100,000-long, very
    similar copies

14
What can we do about repeats?
  • Two main approaches
  • Cluster the reads
  • Link the reads

15
What can we do about repeats?
  • Two main approaches
  • Cluster the reads
  • Link the reads

16
Strategies for sequencing a whole genome
  • Hierarchical Clone-by-clone
  • Break genome into many long pieces
  • Map each long piece onto the genome
  • Sequence each piece with shotgun
  • Example Yeast, Worm, Human, Rat
  • Online version of (1) Walking
  • Break genome into many long pieces
  • Start sequencing each piece with shotgun
  • Construct map as you go
  • Example Rice genome
  • Whole genome shotgun
  • One large shotgun pass on the whole genome
  • Example Drosophila, Human (Celera), Neurospora,
    Mouse, Rat, Fugu

17
Hierarchical Sequencing
18
Hierarchical Sequencing Strategy
genome
  • Obtain a large collection of BAC clones
  • Map them onto the genome (Physical Mapping)
  • Select a minimum tiling path
  • Sequence each clone in the path with shotgun
  • Assemble
  • Put everything together

19
Methods of physical mapping
  • Goal
  • Make a map of the locations of each clone
    relative to one another
  • Use the map to select a minimal set of clones to
    sequence
  • Methods
  • Hybridization
  • Digestion

20
1. Hybridization
p1
pn
  • Short words, the probes, attach to complementary
    words
  • Construct many probes
  • Treat each BAC with all probes
  • Record which ones attach to it
  • Same words attaching to BACS X, Y ? overlap

21
Hybridization Computational Challenge
p1 p2 .pm
  • Matrix
  • m probes ? n clones
  • (i, j) 1, if pi hybridizes to Cj
  • 0, otherwise
  • Definition Consecutive ones matrix
  • A matrix 1s are consecutive
  • Computational problem
  • Reorder the probes so that matrix is in
    consecutive-ones form
  • Can be solved in O(m3) time (m gtgt n)
  • Unfortunately, data is not perfect

0 0 1 ..1
C1 C2 .Cn
1 1 0 ..0
1 0 1...0
pi1pi2.pim
1 1 1 0 0 0..0
0 1 1 1 1 1..0
0 0 1 1 1 0..0
Cj1Cj2 .Cjn
0 0 0 0 0 01 1 1 0
0 0 0 0 0 00 1 1 1
22
2. Digestion
  • Restriction enzymes cut DNA where specific words
    appear
  • Cut each clone separately with an enzyme
  • Run fragments on a gel and measure length
  • Clones Ca, Cb have fragments of length li, lj,
    lk ? overlap
  • Double digestion
  • Cut with enzyme A, enzyme B, then enzymes A B

23
Whole-Genome Shotgun Sequencing
24
Whole Genome Shotgun Sequencing
genome
plasmids (2 10 Kbp)
forward-reverse linked reads
known dist
cosmids (40 Kbp)
500 bp
500 bp
25
The Overlap-Layout-Consensus approach
1. Find overlapping reads
2. Merge good pairs of reads into longer contigs
3. Link contigs to form supercontigs
..ACGATTACAATAGGTT..
4. Derive consensus sequence
many heuristics
26
1. Find Overlapping Reads
  • Sort all k-mers in reads (k 24)
  • Find pairs of reads sharing a k-mer
  • Extend to full alignment throw away if not gt95
    similar

TAGATTACACAGATTAC

TAGATTACACAGATTAC
27
1. Find Overlapping Reads
  • One caveat repeats
  • A k-mer that appears N times, initiates N2
    comparisons
  • ALU 1,000,000 times
  • Solution
  • Discard all k-mers that appear more than c ?
    Coverage, (c 10)

28
1. Find Overlapping Reads
  • Create local multiple alignments from the
    overlapping reads

TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
29
1. Find Overlapping Reads (contd)
  • Correct errors using multiple alignment

C 20
C 20
C 35
C 35
T 30
C 0
C 35
C 35
TAGATTACACAGATTACTGA
C 40
C 40
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
A 15
A 15
A 25
A 25
-
A 0
A 40
A 40
A 25
A 25
  • Score alignments
  • Accept alignments with good scores

30
Basic principle of assembly
  • Repeats confuse us
  • Ability to merge two reads ? ability to
    detect repeats
  • We can dismiss as repeat any overlap of lt t
    similarity
  • Role of error correction
  • Discards 90 of single-letter sequencing errors
  • ? Threshold t increases

31
2. Merge Reads into Contigs (contd)
  • Merge reads up to potential repeat boundaries
  • (Myers, 1995)

32
2. Merge Reads into Contigs (contd)
  • Ignore non-maximal reads
  • Merge only maximal reads into contigs

33
2. Merge Reads into Contigs (contd)
sequencing error
b
a
  • Ignore hanging reads, when detecting repeat
    boundaries

34
2. Merge Reads into Contigs (contd)
?????
Unambiguous
  • Insert non-maximal reads whenever unambiguous

35
3. Link Contigs into Supercontigs
Normal density
Too dense Overcollapsed? (Myers et al. 2000)
Inconsistent links Overcollapsed?
36
3. Link Contigs into Supercontigs (contd)
Find all links between unique contigs
Connect contigs incrementally, if ? 2 links
37
3. Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of
overcollapsed contigs
38
3. Link Contigs into Supercontigs
Contig A
Contig B
Define G ( V, E ) V contigs E ( A, B
) such that d( A, B ) lt C Reason to do so
Efficiency full shortest paths cannot be computed
39
3. Link Contigs into Supercontigs
Contig A
Contig B
Define T contigs linked to either A or B
Fill gap between A and B if there is a path in G
passing only from contigs in T
40
4. Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
  • Derive multiple alignment from pairwise read
    alignments

Derive each consensus base by weighted voting
41
Mouse Genome
  • Several heuristics of iteratively
  • Breaking supercontigs that are suspicious
  • Rejoining supercontigs
  • Size of problem 32,000,000 reads
  • Time 15 days, 1 processor
  • Memory 28 Gb
  • N50 Contig size 16.3 Kb ? 24.8 Kb
  • N50 Supercontig size .265 Mb ? 16.9 Mb

42
Mouse Assembly
43
Sequencing in the (near) future
Write a Comment
User Comments (0)
About PowerShow.com