Gibson Second Edition - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Gibson Second Edition

Description:

Genome Sequencing and Annotation (Part 1) – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 47
Provided by: edut1550
Category:

less

Transcript and Presenter's Notes

Title: Gibson Second Edition


1
Genome Sequencing and Annotation (Part 1)
2
  • Objective of most genome projects
  • Sequencing DNA, mRNA
  • Identify genes characterize gene features
  • This chapter
  • How blocks of DNA seqs. are obtained
  • How these blocks are assembled into contigs then
    genomes
  • Bioinformatics how to do seq. alignment, such
    as cDNA/EST, genome seqs.
  • Annotation of ORF,
  • Other features of gene repetition elements,
    variable distribution of GC content, evolutionary
    conserved elements
  • Gene annotation by cross species annotation

3
2.1 (Part 2) The principle of dideoxy (Sanger)
sequencing
Automated DNA sequencing
1974, F. Sanger developed the chain-termination
method (Sanger sequencing) Sanger won his second
Noble prize for inventing this process
4
Automated DNA sequencing
  • Most current sequencing projects use the chain
    termination method
  • Also known as Sanger sequencing, after its
    inventor
  • Based on action of DNA polymerase
  • Adds nucleotides to complementary strand
  • Requires template DNA and primer

5
Chain-termination sequencing
  • Dideoxynucleotides (ddA, ddT, ddC or ddG) stop
    synthesis
  • Chain terminators (DNA polymerase cannot add
    another nucleotide)
  • Included in amounts so as to terminate every time
    the base appears in the template
  • Use four reactions
  • One for each base A,C,G, and T

Template
3 ATCGGTGCATAGCTTGT 5 5 TAGCCACGTATCGAACA
3 5 TAGCCACGTATCGAA 3 5 TAGCCACGTATCGA
3 5 TAGCCACGTA 3 5 TAGCCA 3 5 TA 3
Sequence reaction products
6
Sequence detection
  • To detect products of sequencing reaction
  • Include labeled nucleotides
  • Formerly, radioactive labels (33P or 35S) were
    used
  • Now fluorescent labels
  • Use different fluorescent tag for each nucleotide
  • Can run all four reactions in a single gel lane
    or capillary tube

TAGCCACGTATCGAA TAGCCACGTATC TAGCCACG TAG
CCACGT
7
Sequence separation
Sequence separation
  • Terminated chains need to be separated
  • Requires one-base-pair resolution
  • See difference between chains of X and X1 base
    pairs
  • Gel electrophoresis
  • Very thin gel
  • High voltage applied
  • Works with radioactive or fluorescent labels
  • Negative pole at the top


C A G T C A G T
8
Sequence reading of radioactively labeled
reactions
  • The final step of sequencing is to read the
    sequence
  • Radioactive labeled reactions
  • Gel dried
  • Placed on X-ray film
  • Film developed, the position of each band becomes
    visible
  • Sequence read from bottom up (the positive pole)
  • Each of the four lanes giving the position of a
    different base A, T, C or G


9
Sequence reading of fluorescently labeled
reactions
  • Fluorescently labeled reactions scanned by laser
    as particular point is passed
  • Color picked up by detector
  • Output sent directly to computer
  • The read out is given both in terms of bases and
    the intensity of each color, so that ambiguous
    readings are easily identified

10
Summary of chain termination sequencing
A primer is extended by DNA polymerase based on
the sequence present in the template strand. The
chain is terminated by different ddNTP that are
complementary to the template strand. Four
reactions are separated on a gel that can resolve
one-base differences. The seq. is then read from
the bottom of gel to the top.
11
High-Throughput Sequencing
  • The new techniques and equipment include
  • (1) Four-color fluorescent dyes have replaced the
    radioactive label
  • (2) Rather than stopping the electrophoresis at a
    particular time, the products are scanned for
    laser-induced fluorescence just before the run
    off the end of the electrophoresis medium
  • (3) Improvements in the chemistry of template
    purification and the sequencing reaction
  • (4) Slab gel electrophoresis gave way to
    capillary electrophoresis with the introduction
    in 1999 of Applied Biosystems ABI Prism 3700
    automated sequencers, which in turn were updated
    with ABI Prism 3730 DNA analyzers in 2003
    (deliver extremely high quality, long reads save
    time and money)

ABI Prism 3730 DNA analyzers
12
Reading sequence traces
  • Base-calling the reading of raw sequence traces
  • Now routinely performed using automated software
    that reads bases, aligns similar seqs. and
    editing
  • Program phred http//www.phrap.org
  • The program assign probability scores to the
    accuracy of each base call as the trace is read

13
2.3 Automated sequence chromatograms
  1. This seq. shows noiseness of the first 30 bp of
    a run.
  2. The middle two rows show a segment of two seqs.
    that are polymorphic for both SNPs and an indel.
  3. A decline in seq. quality typically occurs after
    about 800 bp.

14
  • Ex. 2.1 Reading a sequence trace

The base labeled N due to poor seq. quality Two
peaks of the same height are observed at the same
location, the site is heterozygous for a C and T
SNP.
15
Figure 2.5 An aligned-reads window in consed
Contig Assembly
16
Assembling DNA seq. fragments
  • NCBI dbest databases http//www.ncbi.nlm.nih.gov/D
    atabase/
  • View the EST statistics
  • FTP EST files

17
Assembling DNA seq. fragments
  • IFOM assembler
  • http//bio.ifom-firc.it/ASSEMBLY/assemble.html
  • Multiple EST seqs. ? contig
  • max. number of seqs. you can enter is 10000 !!
  • use gi(15744427, 19124086, 8147732, 8147734,
    20393914,13728017)
  • Length (850, 1062, 634, 596, 869, 768) bp
  • resulting in a single contig consensus seq., can
    be used for similarity search against db

18
Assembling DNA seq. fragments 6 GI fragments
  • gtgi15744427gbBI752849.1BI752849 603022060F1
    NIH_MGC_114 Homo sapiens cDNA clone IMAGE5192510
    5', mRNA sequenceCGGGGTGCTGCGAGCGCGGGGCCAGACCAAGGC
    GGGCCCGGAGCGGAACTTCGGTCCCAGCTCGGTCCCCGGCTCAGTCCCGA
    CGTGGAACTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATT
    CCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGT
    CCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACG
    AATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGG
    CTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGAC
    AGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTA
    CCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTA
    ACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTG
    AAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCC
    TTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAG
    CAGTGCGGCCCCTGGAGCTGGCTACTGCCTGCAGAAGTGCAACGTGCCCT
    TGTTTGTCCAACATGATGCTGCCAACTGTACCTCAAACTCTGGAACCTGA
    TTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTG
    TATATGATCCGGGTGAAGGACTCCTTGATATGCGTTGACTGTGCCATGGG
    AGAGTAGCAGAAAACAGCAGCATGCTCAACCTCCCACTTTCTCTATTGGA
    TGTGGACTCAAAGCCCT
  • gtgi19124086gbBM807263.1BM807263
    AGENCOURT_6574903 NIH_MGC_124 Homo sapiens cDNA
    clone IMAGE5732238 5', mRNA sequenceGTCCGGAATTCCC
    GGGATCTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCC
    ATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCC
    TGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAA
    TGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCT
    GAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAG
    CAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACC
    CTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAAC
    TCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAA
    GAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTT
    TCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCA
    GTGCGGCCCCTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTT
    GTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGA
    TTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTG
    TATACGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGA
    GAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATG
    TGGACTCAAAGCCCCTGGAAGACACTGGAGGACGCCCTGCACTGCTTCTT
    CCAGCCCAGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGG
    GAAGAAGACCCGCGGGGAACAGGGTCCTGAAACCTGACCATTTTGCCCCA
    GACCTTGACCAATCCACCTCATGGCGATTCTCCCTCCAGGAATTCCCCGA
    CCGAGAAAAAATTGGCCACTTCCCCGGAATTTCCCCCCAAAAACTTGGAA
    TTTCACCCAAAACCTTTCCCATGTAAACCCGGAAACCCTGGGGAAGGCT
  • gtgi8147732gbAW958049.1AW958049 EST370119 MAGE
    resequences, MAGE Homo sapiens cDNA, mRNA
    sequenceGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCACGAGTG
    GAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACA
    TGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGA
    TCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGG
    GTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAA
    CAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGC
    CCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAG
    TTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCG
    TGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCC
    ACCTCATGCGATTCTTCATCAGGAATTCACAGACGAGAAAGATCTGCCAC
    TCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGAACCTTCCAATGAA
    GCGAGAATCTTGTGAAGCTGAAGAACAGTCTGGAAGGCAAGATGAGCTTT
    TTGCTGGGAATGCGCACGTGGAAAGGCAGAATTCGGTCATAA
  • gtgi8147734gbAW958051.1AW958051 EST370121 MAGE
    resequences, MAGE Homo sapiens cDNA, mRNA
    sequenceGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTT
    TGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTA
    AGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTAT
    ATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAG
    TAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGG
    ACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAG
    CCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAA
    GAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCC
    TGACAATCCACCTTATGCGATTCTCCATCAGGAATTCACAGACGAGAAAG
    ATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCT
    TCCAATGAAGCGAGAGTCTTGTGATGCTTGAGGAGCAATCTGGAGGGCAT
    ATGAGCTTTTTGCTGTGATTGCGCACCTGGGAATGCAAAACTCCGTCATT
    ACTG
  • gtgi20393914gbBQ213074.1BQ213074
    AGENCOURT_7559959 NIH_MGC_72 Homo sapiens cDNA
    clone IMAGE6055692 5', mRNA sequenceAGATCTGCCACTC
    CCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGC
    GAGAGTCTTGTGATGCTGAGGAGCAGTCTGGAGGGCAGTATGAGCTTTTT
    GCTGTGATTGCGCACGTGGGAATGGCAGACTCCGGTCATTACTGTGTCTA
    CATCCGGAATGCTGTGGATGGAAAATGGTTCTGCTTCAATGACTCCAATA
    TTTGCTTGGTGTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAAC
    TACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTG
    CTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCA
    TTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAG
    AAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATA
    TTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAAT
    GGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGC
    TCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTT
    CAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGGC
    CAAGGGCAGTGGCAGGGGGTATTTCAGTATTATACCACTGCTGTGACCAG
    ACTTGTATACTGGCTGAATATCAGGGCTGGTTGTAATTTTTTCCCTTTGA
    AGAAACACCATTAATTTCCTAATGAATCCAAGTGGTTTGTAACTTGCCTA
    TTCCTTTTATTCCAGCAAAAAATTAATTGATCATCCCCTCCCCCAAAAAA
    TAGGGG
  • gtgi13728017gbBG206330.1BG206330 RST25778
    Athersys RAGE Library Homo sapiens cDNA, mRNA
    sequenceTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTAC
    CACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTA
    ATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTT
    CCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAG
    CATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTA
    TCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGA
    TGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCG
    CTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAG
    GGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGCCAAA
    GGTCAGTGGCAGGGGGTATTTCAGTATTATACAACTGCTGTGACCAGACT
    TGTATACTGGCTGAATATCAGTGCTGTTTGTAATTTTTCACTTTGAGAAC
    CAACATTAATTCCATATGAATCAAGTGTTTTGGAACTGCTATTCATTTAT
    TCAGCAAATATTTATTGGTCATCTTTTCTCCATAAGATAGTGTGATAAAC
    ACAGCATGAATAAAGGTATTTTCCACACAGACAAGTGTTTTTTCACAAAA
    TTATTNATTTTGNTGGGGCTGTGGCGGCCGCTTCCTTTATGGGGGGGAAT
    TTAGAACCCGTTCCTGACGCGGGGGN

19
Assembling DNA seq. fragments
20
Assembling DNA seq. fragments
21
Assembling DNA seq. fragments
Assembled mRNA sequence
22
Box 2.1 Pairwise Sequence Alignment
  • The most important class of bioinformatics tools
    pairwise alignment of DNA and protein seqs.
  • alignment 1 alignment 2
  • Seq. 1 ACGCTGA ACGCTGA
  • Seq. 2 A - - CTGT ACTGT - -
  • Seeks alignments ? high seq. identity, few
    mismatchs and gaps
  • Assumption the observed identity in seqs. to be
    aligned is the result of either random or of a
    shared evolutionary origin
  • Identity ? similarity
  • Sequence identity Homology (a risky assumption)
  • Sequence identity ? Homology

23
Box 2.1 Pairwise Sequence Alignment
Same true alignment arise through different
evolutionary events Scoring scheme substitution
? -1, indel ? -5, match ? 3
indel
Score 9 5
4
4
Figure A Common evolutionary events and their
effects on alignment
24
Box 2.1 Pairwise Sequence Alignment
  • Find the optimal score ? the best guess for the
    true alignment
  • Find the optimal pairwise alignment of two seqs.
    ? inserted gaps into one or both of them ?
    maximize the total alignment score
  • Dynamic programming (DP) Needleman and Wunsch
    (1970), Smith and Waterman (1980), this algorithm
    guarantees that we find all optimal alignments of
    two seqs. of lengths m and n
  • BLAST is based on DP with improvement on speed
  • Prof. Waterman http//www.usc.edu/dept/LAS/biosci/
    faculty/waterman.html

25
Box 2.1 Pairwise Sequence Alignment
The score for alignment of i residues of sequence
1 against j residues of sequence 2 is given by
where c(i,j) the score for alignment of
residues i and j and takes the value 3 for a
match or -1 for a mismatch, c(-,j) the penalty
for aligning a residue with a gap, which takes
the value of -5
26
Box 2.1 Pairwise Sequence Alignment
  • The entry for S(1,1) is the maximum of the
    following three events
  • S(0,0) c(A,A) 0 3 3 c(A,A)
    c(1,1)
  • S(0,1) c(A, -) -5 -5 -10 c(A, -)
    c(1, -)
  • S(1,0) c(-, A) -5 -5 -10 c(- ,A)
    c(-, 1)
  • Similarly, one finds S(2,1) as the maximum of
    three values (-5)-1-6 3-5-2 and (-10)-5-15
    ? the best is entry is the addition of the C
    indel to the A-A match, for a score of -2 (see
    next page).

27
Box 2.1 Pairwise Sequence Alignment
The alignment matrix of sequences 1 and 2
S(2,1) max S(1,0) c(2,1), S(1,1) c(2,-),
S(2,0) c(-,1) max S(1,0) c(C,A), S(1,1)
c(C,-), S(2,0) c(-,A) max -5-1, 3-5,
-10-5 -2
28
Box 2.1 Pairwise Sequence Alignment
Traceback ? determine the actual alignment From
the top right hand corner ? the (7,5) cell
For example the 1 in the (7,5) cell could only be
reached by the addition of the mismatch
A-T ACGCTGA A - - CTGT or ACGCTGA AC - - TGT 4
matches 1 mismatch 2 indels Ambiguity has to
do with which C in seq. 1 aligns with the C in
seq. 2
29
Box 2.1 Pairwise Sequence Alignment
  • Parameters settings - Gap penalties
  • Default settings are the easiest to use but they
    are not necessarily yield the correct alignment
  • constant penalty ? independent of the length of
    gap, A
  • proportional penalty ? penalty is proportional to
    the length L of the gap, BL (that is what we used
    in the this lecture)
  • affine gap penalty ? gap-opening penalty
    gap-extension penalty ABL
  • There is no rule for predicting the penalty that
    best suits the alignment
  • Optimal penalties vary from seq. to seq. ? it is
    a matter of trial and error
  • Usually A gt B, because of opening a gap (usually
    A/B 10)
  • Hint (1) compare distantly related seqs. high A
    and very low B often give the best results ?
    penalized more on their existence than on their
    length, (2) compare closely related seqs.,
    penalize both of extension and extension

30
Exercise 2.2 Computing an optimal sequence
alignment
  • Two score schemes
  • Gap penalty -5, mismatch -1, match 3
  • Gap penalty -1, mismatch -1, match 3
  • First alignment score 53 2(-1) 13
  • Second/Third alignment score 63 2(-5)
    8
  • (2) First alignment score 53 2(-1) 13
  • Second/Third alignment score 63 2(-1)
    16
  • A more serious problem identify the wrong
    alignment

31
Exercise 2.2 Computing an optimal sequence
alignment
Gap penalty -5
Gap penalty -1
32
Emerging Sequencing Methods
  • Costs of genome sequencing
  • Mid-2000 - 30-50 Million dollars to sequencing a
    mammalian genome
  • Target 1000 per human genome by the year 2010
  • J. Craig Benter Foundation - 500,000 award for
    the first person to achieve this goal
  • New technologies
  • Sequencing by hybridization (SBH) detect
    whether an exact match is present in a sample of
    DNA or not
  • Mass spectrophotometric technique ionized
    fragment, time of flight
  • Nanopore sequencing strategies - Ultrafast and
    relative inexpensive sequencing of long DNA
    fragments
  • Single-molecule approach Solexa, Visigen and
    Genovoxx
  • Single-molecule polony sequencing

33
Figure 2.6 Single-molecule polony sequencing
Emerging Sequencing Methods
Dilute solution of DNA are plated onto a glass
microscope slide. In situ PCR produces thousands
of tiny colonies of DNA, which incorporated of
single dye-labeled dNTPs. Polony PCR colonies
(???) The slide is read after each cycle
of Incorporation of a new base, allowing short
seqs. to be determined. Each numbered polony
produces a short 20-25 nucleotide seq. as
shown. These can then be assembled
computationally into a contiguous seq.
34
Figure 2.7 (Part 1) Hierarchical versus shotgun
sequencing
Genome Sequencing
  • Whole genome seqs. are assembled from
  • 105 of fragments, each typically between
  • 500 and 1000 bp in length.
  • Two general approaches for fragmentation
  • and assembly (1) hierarchical seq. (2) shotgun
  • seq.
  • For historical overview, see
  • http//www.sciencemag.org/feature/plus/sfg/human/t
    imeline1.shtml
  • Hierarchical seq.
  • First develop a low resolution physical
    alignment to measure the seq. is obtained in
    large order pieces.
  • Break the genome into small fragments and use
    computer algorithms to assemble them, see Figure
    2.7
  • Most new genome projects adopt the shotgun
    approach.

35
Genome Sequencing hierarchical sequencing
  • Top down, map-based or clone-by-clone strategy
  • late 1980
  • Genome ? break into small fragments
  • The relative locations of the fragments are known
    BEFORE sequencing
  • Advantages
  • It fostered (help develop) assembly of
    high-resolution physical and genetic maps
  • Allow groups working around the global
  • Technology for cloning large fragments of genomes
    are progressed rapidly throughout the1990s, such
    as E. coli, S. cerevisiae, C. elegans. A.
    thaliana.
  • Top-down seq. ? clone seqs. as managable units of
    framgments (50 200 kb in length)
  • Clone vectors BAC (300 kb), PAC (100 kb),
    phage-derived cosmids

36
Figure 2.7 (Part 2) Shotgun sequencing
Genome Sequencing Shotgun sequencing
In the shotgun approach, no attempt is made to
order the clones in advance, Instead, the whole
genome is assembled using computer algorithms
that order contigs based on their overlapping
sequences.
37
Figure 2.8 Cloning vectors used in genome
sequencing
Cloning vectors used in genome sequencing
38
Genome Sequencing hierarchical sequencing
  • DNA libraries
  • By restriction enzyme (RE) or sonication (??????)
  • Fragments are ligated into a multiple cloning
    site (mcs) in the vector
  • Aim for 5- to 10-fold redundancy ? larger than 5
    to 10 times in the genome library
  • Each clone will have different ends ? possible to
    select a scaffold of clones that forms a
    contiguous seq. coverage a tiling (???) path
  • By aligning the regions of overlap (Fig. 2.9)
  • The tiling path can be assembled using a
    combination of 3 methods (1) hybridization, (2)
    fingerprinting, and (3) end-sequencing

39
Figure 2.9 Hierarchical assembly of a
sequence-contig scaffold (supercontig)
Genome Sequencing hierarchical sequencing
  • A minimal tiling path through a library of
    aligned BAC clones that ensures complete coverage
    of the chromosome is chosen.
  • After sequencing independent shotgun libraries
    for each BAC.
  • Small gaps in the sequenced clone contigs remain.
  • These are closed as far as possible by merging
    the two BAC sequences, as well as by the addition
    of mate-pair information (yellow) and cDNA
    structural information (red), which establishes
    the orientation and distance between cloned
    segments.

40
Genome Sequencing hierarchical sequencing
  • Hybridization
  • All of the clones in a library that carry a
    particular seq. can be identified rapidly by
    hybridizing a small radioactively or chemically
    labeled probe containing the seq. to a filter on
    which is printed an array of 10000 of clones
    (Fig. 2.10A)
  • Fingerprinting
  • Study the Restriction Enzyme (RE) patterns
  • Assemble contigs of large insert clones is to
    compare and
  • align them according to RE
  • RE 6 bp ? 46 212 4000 bp
  • For BAC, 100 kb ? 100 kb/4 kbp 20 30
    fragments
  • these fragments can be separated by
    electrophesis ? Fingerprint profile ? BAC
    alignment by gel ? software alignment ?
    overlapping ? Contigs ? assemble of Mb length
    contigs

41
Figure 2.10 Aligning BAC clones by hybridization
and fingerprinting
Genome Sequencing hierarchical sequencing
  • (A) A macroarray of BAC clones is probed
  • with a short, radioactive fragment to
  • identify all BACs that carry a specific
  • fragment.
  • These clones are digested with a RE, end-
  • labeled, and separated by gel electrophoresis,
  • Software converts the bands to a virtual
  • profile, shown hypothetically for a small
  • portion of four bands (high-ligated box in
  • part B). Shared bands (red or blue) imply
  • that the two clones share the same seq.
  • Green indicates the vector band common to
  • all clones.
  • The fingerprint profile is then converted into
  • a BAC alignment, In this example, clone 2
  • does not share any bands with the others and
  • so is placed into a seq. BAC contig, while the
  • other three clones form a tiling path.

42
Genome Sequencing hierarchical sequencing
  • End-sequencing
  • Fill in the gaps after fingerprinting. How ?
  • sequencing both ends of the collection of BAC
    clones
  • Once a critical threshold of seqs. have been
    achieved ? overlap
  • For example, along a 10 Mbp genome, end seqs. of
    10,000 BAC clones, ? provide a seq. tag every 5kb
    (for a 5-fold coverage)
  • Along a 10 Mbp genome ?10 Mbp/10000 BAC ? 1
    kbp/BAC
  • Five fold ? 10 Mb/2000 BAC 5 kb (a seq. tag
    distance)
  • Given this tag density, it is possible to close
    gap lt 50 kb
  • Once the Tiling path is chosen ? shotgun the BAC
    clones into small fragments
  • Subcloning, use M13 phagemid (1 kb, exist as
    dsDNA and ssDNA
  • or clone 2 3 kb fragments into a plasmid vector

43
Genome Sequencing Shotgun sequencing
  • Use computer algorithm to assemble the seqs.
    (100,000)
  • About 5 10 folds redundancy for each fragment
  • Library - From a single whole genome
  • After MSA ? screen out repetitive seqs., overlap
    reads of the same seq. ? generate
  • unitigs and scaffolds ? gt90 of the seqs. are
    assembled
  • Finishing phase closing gaps, cleaning up
    ambiguities ? take as much time as
  • the shotgun phase
  • Users are asked to trust the assemblies
  • Celera Genomics used the following software to
    assemble the seqs.
  • Screener to mask (not removed) seqs. that
    contain repetitive DNA
  • (such as microsatellites, LINE, Alu repeats,
    retrotransposons and ribosomal DNA)
  • Overlapper compares every unscreened read
    against every other unscreened read,
  • searching for overlaps of a predetermined length
    and identity.
  • Parallel processing on 40 supercomputers, each
    with 4GB RAM, allowed the 27 M
  • screened human seqs. reads to be overlapped in lt
    5 days !
  • Repeat-induced overlaps of a seq. are resolved
    using the Unitigger (see Figure 2.11).
  • Scaffolder uses mate-pair information to link
    U-unitigs into scaffold contigs

44
Genome Sequencing Shotgun sequencing
  • Figure 2.11
  • Seq. alignment between two or more shotgun clones
    can arise between unique seqs. (left) or
    repetitive seqs. (right).
  • (B) The Overlapper aligns unitigs, which are
    identified as unique seq. alignments (U-untigs)
    or overcollapsed repeats (blue).
  • Two contigs can be aligned and
  • oriented by using mate-pair seq.
  • information from the ends of longer (10- or
    50-kb) clones, as shown at the bottom, while
    mate-pairs from 2-kb fragments allow assembly of
    scaffolds despite the presence of simple repeats
    such as microsatellites (blue) that are masked
    before performing alignments.

Figure 2.11 U-unitigs and repeat resolution
45
Genome Sequencing Shotgun sequencing
Figure 2.12 shows the estimated coverage of the
fly and human whole genomes after initial
assembly in both cases, 84 or more of the
genomes was covered by scaffolds at least 100 kb
in length, while most scaffolds were in the Mb
range. ? seq. coverage from 5x to 10x ? a 10 ?
in the proportion of scaffolds of lengths up to 1
Mb. The plot shows the percentage of Scaffolds
that have a length greater than that indicated
for the fly 10x, human 8x (CSA) and human 5x
(whole genome assembly WGA) seqs. generated by
Celera. The fly and CSA assemblies include
shredded (????) seqs. generated from BAC clones
by public genomes sequencing efforts.
Figure 2.12 Proportion of fly and human genomes
in large scaffolds
46
  • NCTS http//math.cts.nthu.edu.tw/Mathematics/confe
    rence-PT2005.html
  • UCSD
  • http//research.calit2.net/recomb-workshop05/
Write a Comment
User Comments (0)
About PowerShow.com