CSE182-L10 - PowerPoint PPT Presentation

About This Presentation
Title:

CSE182-L10

Description:

Keep the 'high-scoring' ones as evidence of true overlap. What is the problem? ... k-mer based overlap (Piegeonhole principle again) Consider a 25bp sequence. ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 40
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE182-L10


1
CSE182-L10
  • LW statistics/Assembly

2
Whole Genome Shotgun
  • Break up the entire genome into pieces
  • Sequence ends, and assemble using a computer
  • LW statistics Repeats argue against the success
    of such an approach

Alternative build a roadmap of the genome, with
physical clones mapped for each region. Sequence
each of the clones, and put them together
3
Questions
  • Algorithmic How do you put the genome back
    together from the pieces?
  • Statistical? How many pieces do you need to
    sequence, etc.?
  • The answer to the statistical questions had
    already been given in the context of mapping, by
    Lander and Waterman.

4
Lander Waterman Statistics
  • The fragments are falling randomly on the genome
  • Overlapping fragments form islands of contiguous
    sequence.
  • Ideally, we want one island for each chromosome.
    How many fragments should we sequence?

L
G
5
Lander Waterman Statistics
L
G
6
LW statistics questions
  • As the coverage c increases, more and more areas
    of the genome are likely to be covered. Ideally,
    you want to see 1 island.
  • Q1 What is the expected number of islands?
  • Ans N exp(-c?)
  • The number increases at first, and gradually
    decreases.

7
Analysis Expected Number Islands
  • Computing Expected islands.
  • Let Xi1 if an island ends at position i, Xi0
    otherwise.
  • Number of islands ?i Xi
  • Expected islands E(?i Xi) ?i E(Xi)

8
Prob. of an island ending at i
i
L
T
  • E(Xi) Prob (Island ends at pos. i)
  • Prob(clone began at position i-L1
  • AND no clone began in the next L-T positions)

9
LW statistics
  • PrIsland contains exactly j clones?
  • Consider an island that has already begun. With
    probability e-c?, it will never be continued.
    Therefore
  • PrIsland contains exactly j clones
  • Expected j-clone islands

10
Expected of clones in an island
  • Expected of clones in an island

Q How? Why do we care?
Often, at the beginning of a genome project, we
do not know the length of the genome. This
equation helps us determine the length.
11
Expected length of an island
12
Whole Genome Sequencing Assembly
13
Whole Genome Shotgun
  • Break up the entire genome into pieces
  • Sequence ends, and assemble using a computer
  • LW statistics Repeats argue against the success
    of such an approach

14
Assembly Basics
  • Three main components
  • Overlap
  • Layout
  • Consensus

15
Overlap
  • Given a pair of fragments s1 and s2, do they
    belong together?
  • How would you compute such a match?

16
Overlap
  • Si,j optimum score of an alignment of
    s11..i against a suffix of s21..j

j
i
  • The best prefix-suffix alignment is given by
  • Maxi Si,n

17
Overlap Detection
  • Compute the best prefix-suffix alignments between
    each pair of fragments.
  • Keep the high-scoring ones as evidence of true
    overlap.
  • What is the problem?

18
Overlap detection problem
  • Consider the number of fragments. The LW
    statistics say that we need good coverage (c8,
    10) to get most of the base-pairs.
  • G 3000Mb, L500
  • Coverage LN/G 10
  • N 103109/500 6107
  • Number of comparisons needed 3.6 1015
  • Not good! (Only a small fraction are true
    overlaps)

19
k-mer based overlap (Piegeonhole principle again)
  • Consider a 25bp sequence.
  • Expected number of occurrences in the genome
  • 31094-25 210-6
  • A 25-bp sequence appears is unique to the genome!
  • Two overlapping sequences should share a 25-mer
  • Two non-overlapping sequences should not!

25bp
20
Sorting k-mers
  • Build a list of k-mers that appear in the
    sequences and their reverse complements
  • Create a record with 4 entries
  • K-mer
  • Sequence number
  • Position in the sequence
  • Reverse complementation flag
  • Sort a vector of these according to k-mer
  • How many records per k-mer are expected?
  • If number of records exceeds threshold, discard
    (why?)

K-mer
S.id Pos.
21
Alignment module
  • Coalesce k-mer hits into longer, gap-free partial
    alignments.
  • These extended k-mer hits are saved.
  • For each pair of sequences, form a directed
    graph.
  • For each maximal path in the graph, construct an
    alignment.
  • Refine alignment via banded DP

22
Problem2 Size
  • Islands might simply be too small in length
  • ? (1-T/L) (1-50/500) 0.9, c 8.
  • Islands N e-c? 45K
  • Size of an island 54K
  • Not enough to make it an acceptable assembly!
  • PLUS, there is the problem of Repeats, Chimerism
    etc.

23
Solution 2 Clones can have mate-pairs
  • Recall that we sequence about 1000bp of the end
    of a clone
  • If we sequenced both ends, we get extra
    information, particularly if we know the length
    of the original clone.

24
Mate Pairs
  • Mate-pairs allow you to merge islands (contigs)
    into super-contigs

25
Super-contigs are quite large
  • Make clones of truly predictable length. EX 3
    sets can be used 2Kb, 10Kb and 50Kb. The
    variance in these lengths should be small.
  • Use the mate-pairs to order and orient the
    contigs, and make super-contigs.

26
Whole genome shotgun
  • Input
  • Shotgun sequence fragments (reads)
  • Mate pairs
  • Output
  • A single sequence created by consensus of
    overlapping reads
  • First generation of assemblers did not include
    mate-pairs (Phrap, CAP..)
  • Second generation CA, Arachne, Euler
  • We will discuss Arachne, a freely available
    sequence assembler (2nd generation)

27
Problem 3 Repeats
28
Repeats Chimerisms
  • 40-50 of the human genome is made up of
    repetitive elements.
  • Repeats can cause great problems in the assembly!
  • Chimerism causes a clone to be from two different
    parts of the genome. Can again give a completely
    wrong assembly

29
Repeat detection
  • Lander Waterman strikes again!
  • The expected number of clones in a Repeat
    containing island is MUCH larger than in a
    non-repeat containing island (contig).
  • Thus, every contig can be marked as Unique, or
    non-unique. In the first step, throw away the
    non-unique islands.

Repeat
30
Detecting Repeat Contigs 1 Read Density
  • Compute the log-odds ratio of two hypotheses
  • H1 The contig is from a unique region of the
    genome.
  • The contig is from a region that is repeated at
    least twice

31
Detecting Chimeric reads
  • Chimeric reads Reads that contain sequence from
    two genomic locations.
  • Good overlaps G(a,b) if a,b overlap with a high
    score
  • Transitive overlap T(a,c) if G(a,b), and G(b,c)
  • Find a point x across which only transitive
    overlaps occur. X is a point of chimerism

32
Contig assembly
  • Reads are merged into contigs upto repeat
    boundaries.
  • (a,b) (a,c) overlap, (b,c) should overlap as
    well. Also,
  • shift(a,c)shift(a,b)shift(b,c)
  • Most of the contigs are unique pieces of the
    genome, and end at some Repeat boundary.
  • Some contigs might be entirely within repeats.
    These must be detected

33
Creating Super Contigs
34
Supercontig assembly
  • Supercontigs are built incrementally
  • Initially, each contig is a supercontig.
  • In each round, a pair of super-contigs is merged
    until no more can be performed.
  • Create a Priority Queue with a score for every
    pair of mergeable supercontigs.
  • Score has two terms
  • A reward for multiple mate-pair links
  • A penalty for distance between the links.

35
Supercontig merging
  • Remove the top scoring pair (S1,S2) from the
    priority queue.
  • Merge (S1,S2) to form contig T.
  • Remove all pairs in Q containing S1 or S2
  • Find all supercontigs W that share mate-pair
    links with T and insert (T,W) into the priority
    queue.
  • Detect Repeated Supercontigs and remove

36
Repeat Supercontigs
  • If the distance between two super-contigs is not
    correct, they are marked as Repeated
  • If transitivity is not maintained, then there is
    a Repeat

37
Filling gaps in Supercontigs
38
Consensus Derivation
  • Consensus sequence is created by converting
    pairwise read alignments into multiple-read
    alignments

39
Summary
  • Whole genome shotgun is now routine
  • Human, Mouse, Rat, Dog, Chimpanzee..
  • Many Prokaryotes (One can be sequenced in a day)
  • Plant genomes Arabidopsis, Rice
  • Model organisms Worm, Fly, Yeast
  • A lot is not known about genome structure,
    organization and function.
  • Comparative genomics offers low hanging fruit
Write a Comment
User Comments (0)
About PowerShow.com