Gibson Second Edition

About This Presentation

Title:

Gibson Second Edition

Description:

Genome Sequencing and Annotation (Part 1) – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 47

Provided by: edut1550

Category:

more less

Transcript and Presenter's Notes

Title: Gibson Second Edition

1
Genome Sequencing and Annotation (Part 1)
2

Objective of most genome projects
Sequencing DNA, mRNA
Identify genes characterize gene features
This chapter
How blocks of DNA seqs. are obtained
How these blocks are assembled into contigs then
genomes
Bioinformatics how to do seq. alignment, such
as cDNA/EST, genome seqs.
Annotation of ORF,
Other features of gene repetition elements,
variable distribution of GC content, evolutionary
conserved elements
Gene annotation by cross species annotation

3
2.1 (Part 2) The principle of dideoxy (Sanger)
sequencing
Automated DNA sequencing
1974, F. Sanger developed the chain-termination
method (Sanger sequencing) Sanger won his second
Noble prize for inventing this process
4
Automated DNA sequencing

Most current sequencing projects use the chain
termination method
Also known as Sanger sequencing, after its
inventor
Based on action of DNA polymerase
Adds nucleotides to complementary strand
Requires template DNA and primer

5
Chain-termination sequencing

Dideoxynucleotides (ddA, ddT, ddC or ddG) stop
synthesis
Chain terminators (DNA polymerase cannot add
another nucleotide)
Included in amounts so as to terminate every time
the base appears in the template
Use four reactions
One for each base A,C,G, and T

Template
3 ATCGGTGCATAGCTTGT 5 5 TAGCCACGTATCGAACA
3 5 TAGCCACGTATCGAA 3 5 TAGCCACGTATCGA
3 5 TAGCCACGTA 3 5 TAGCCA 3 5 TA 3
Sequence reaction products
6
Sequence detection

To detect products of sequencing reaction
Include labeled nucleotides
Formerly, radioactive labels (33P or 35S) were
used
Now fluorescent labels
Use different fluorescent tag for each nucleotide
Can run all four reactions in a single gel lane
or capillary tube

TAGCCACGTATCGAA TAGCCACGTATC TAGCCACG TAG
CCACGT
7
Sequence separation
Sequence separation

Terminated chains need to be separated
Requires one-base-pair resolution
See difference between chains of X and X1 base
pairs
Gel electrophoresis
Very thin gel
High voltage applied
Works with radioactive or fluorescent labels
Negative pole at the top

C A G T C A G T
8
Sequence reading of radioactively labeled
reactions

The final step of sequencing is to read the
sequence
Radioactive labeled reactions
Gel dried
Placed on X-ray film
Film developed, the position of each band becomes
visible
Sequence read from bottom up (the positive pole)
Each of the four lanes giving the position of a
different base A, T, C or G

9
Sequence reading of fluorescently labeled
reactions

Fluorescently labeled reactions scanned by laser
as particular point is passed
Color picked up by detector
Output sent directly to computer
The read out is given both in terms of bases and
the intensity of each color, so that ambiguous
readings are easily identified

10
Summary of chain termination sequencing
A primer is extended by DNA polymerase based on
the sequence present in the template strand. The
chain is terminated by different ddNTP that are
complementary to the template strand. Four
reactions are separated on a gel that can resolve
one-base differences. The seq. is then read from
the bottom of gel to the top.
11
High-Throughput Sequencing

The new techniques and equipment include
(1) Four-color fluorescent dyes have replaced the
radioactive label
(2) Rather than stopping the electrophoresis at a
particular time, the products are scanned for
laser-induced fluorescence just before the run
off the end of the electrophoresis medium
(3) Improvements in the chemistry of template
purification and the sequencing reaction
(4) Slab gel electrophoresis gave way to
capillary electrophoresis with the introduction
in 1999 of Applied Biosystems ABI Prism 3700
automated sequencers, which in turn were updated
with ABI Prism 3730 DNA analyzers in 2003
(deliver extremely high quality, long reads save
time and money)

ABI Prism 3730 DNA analyzers
12
Reading sequence traces

Base-calling the reading of raw sequence traces
Now routinely performed using automated software
that reads bases, aligns similar seqs. and
editing
Program phred http//www.phrap.org
The program assign probability scores to the
accuracy of each base call as the trace is read

13
2.3 Automated sequence chromatograms

This seq. shows noiseness of the first 30 bp of
a run.
The middle two rows show a segment of two seqs.
that are polymorphic for both SNPs and an indel.
A decline in seq. quality typically occurs after
about 800 bp.

Ex. 2.1 Reading a sequence trace

The base labeled N due to poor seq. quality Two
peaks of the same height are observed at the same
location, the site is heterozygous for a C and T
SNP.
15
Figure 2.5 An aligned-reads window in consed
Contig Assembly
16
Assembling DNA seq. fragments

NCBI dbest databases http//www.ncbi.nlm.nih.gov/D
atabase/
View the EST statistics
FTP EST files

17
Assembling DNA seq. fragments

IFOM assembler
http//bio.ifom-firc.it/ASSEMBLY/assemble.html
Multiple EST seqs. ? contig
max. number of seqs. you can enter is 10000 !!
use gi(15744427, 19124086, 8147732, 8147734,
20393914,13728017)
Length (850, 1062, 634, 596, 869, 768) bp
resulting in a single contig consensus seq., can
be used for similarity search against db

18
Assembling DNA seq. fragments 6 GI fragments

gtgi15744427gbBI752849.1BI752849 603022060F1
NIH_MGC_114 Homo sapiens cDNA clone IMAGE5192510
5', mRNA sequenceCGGGGTGCTGCGAGCGCGGGGCCAGACCAAGGC
GGGCCCGGAGCGGAACTTCGGTCCCAGCTCGGTCCCCGGCTCAGTCCCGA
CGTGGAACTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATT
CCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGT
CCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACG
AATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGG
CTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGAC
AGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTA
CCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTA
ACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTG
AAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCC
TTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAG
CAGTGCGGCCCCTGGAGCTGGCTACTGCCTGCAGAAGTGCAACGTGCCCT
TGTTTGTCCAACATGATGCTGCCAACTGTACCTCAAACTCTGGAACCTGA
TTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTG
TATATGATCCGGGTGAAGGACTCCTTGATATGCGTTGACTGTGCCATGGG
AGAGTAGCAGAAAACAGCAGCATGCTCAACCTCCCACTTTCTCTATTGGA
TGTGGACTCAAAGCCCT
gtgi19124086gbBM807263.1BM807263
AGENCOURT_6574903 NIH_MGC_124 Homo sapiens cDNA
clone IMAGE5732238 5', mRNA sequenceGTCCGGAATTCCC
GGGATCTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCC
ATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCC
TGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAA
TGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCT
GAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAG
CAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACC
CTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAAC
TCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAA
GAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTT
TCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCA
GTGCGGCCCCTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTT
GTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGA
TTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTG
TATACGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGA
GAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATG
TGGACTCAAAGCCCCTGGAAGACACTGGAGGACGCCCTGCACTGCTTCTT
CCAGCCCAGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGG
GAAGAAGACCCGCGGGGAACAGGGTCCTGAAACCTGACCATTTTGCCCCA
GACCTTGACCAATCCACCTCATGGCGATTCTCCCTCCAGGAATTCCCCGA
CCGAGAAAAAATTGGCCACTTCCCCGGAATTTCCCCCCAAAAACTTGGAA
TTTCACCCAAAACCTTTCCCATGTAAACCCGGAAACCCTGGGGAAGGCT
gtgi8147732gbAW958049.1AW958049 EST370119 MAGE
resequences, MAGE Homo sapiens cDNA, mRNA
sequenceGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCACGAGTG
GAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACA
TGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGA
TCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGG
GTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAA
CAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGC
CCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAG
TTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCG
TGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCC
ACCTCATGCGATTCTTCATCAGGAATTCACAGACGAGAAAGATCTGCCAC
TCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGAACCTTCCAATGAA
GCGAGAATCTTGTGAAGCTGAAGAACAGTCTGGAAGGCAAGATGAGCTTT
TTGCTGGGAATGCGCACGTGGAAAGGCAGAATTCGGTCATAA
gtgi8147734gbAW958051.1AW958051 EST370121 MAGE
resequences, MAGE Homo sapiens cDNA, mRNA
sequenceGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTT
TGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTA
AGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTAT
ATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAG
TAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGG
ACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAG
CCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAA
GAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCC
TGACAATCCACCTTATGCGATTCTCCATCAGGAATTCACAGACGAGAAAG
ATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCT
TCCAATGAAGCGAGAGTCTTGTGATGCTTGAGGAGCAATCTGGAGGGCAT
ATGAGCTTTTTGCTGTGATTGCGCACCTGGGAATGCAAAACTCCGTCATT
ACTG
gtgi20393914gbBQ213074.1BQ213074
AGENCOURT_7559959 NIH_MGC_72 Homo sapiens cDNA
clone IMAGE6055692 5', mRNA sequenceAGATCTGCCACTC
CCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGC
GAGAGTCTTGTGATGCTGAGGAGCAGTCTGGAGGGCAGTATGAGCTTTTT
GCTGTGATTGCGCACGTGGGAATGGCAGACTCCGGTCATTACTGTGTCTA
CATCCGGAATGCTGTGGATGGAAAATGGTTCTGCTTCAATGACTCCAATA
TTTGCTTGGTGTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAAC
TACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTG
CTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCA
TTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAG
AAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATA
TTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAAT
GGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGC
TCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTT
CAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGGC
CAAGGGCAGTGGCAGGGGGTATTTCAGTATTATACCACTGCTGTGACCAG
ACTTGTATACTGGCTGAATATCAGGGCTGGTTGTAATTTTTTCCCTTTGA
AGAAACACCATTAATTTCCTAATGAATCCAAGTGGTTTGTAACTTGCCTA
TTCCTTTTATTCCAGCAAAAAATTAATTGATCATCCCCTCCCCCAAAAAA
TAGGGG
gtgi13728017gbBG206330.1BG206330 RST25778
Athersys RAGE Library Homo sapiens cDNA, mRNA
sequenceTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTAC
CACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTA
ATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTT
CCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAG
CATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTA
TCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGA
TGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCG
CTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAG
GGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGCCAAA
GGTCAGTGGCAGGGGGTATTTCAGTATTATACAACTGCTGTGACCAGACT
TGTATACTGGCTGAATATCAGTGCTGTTTGTAATTTTTCACTTTGAGAAC
CAACATTAATTCCATATGAATCAAGTGTTTTGGAACTGCTATTCATTTAT
TCAGCAAATATTTATTGGTCATCTTTTCTCCATAAGATAGTGTGATAAAC
ACAGCATGAATAAAGGTATTTTCCACACAGACAAGTGTTTTTTCACAAAA
TTATTNATTTTGNTGGGGCTGTGGCGGCCGCTTCCTTTATGGGGGGGAAT
TTAGAACCCGTTCCTGACGCGGGGGN

19
Assembling DNA seq. fragments
20
Assembling DNA seq. fragments
21
Assembling DNA seq. fragments
Assembled mRNA sequence
22
Box 2.1 Pairwise Sequence Alignment

The most important class of bioinformatics tools
pairwise alignment of DNA and protein seqs.
alignment 1 alignment 2
Seq. 1 ACGCTGA ACGCTGA
Seq. 2 A - - CTGT ACTGT - -
Seeks alignments ? high seq. identity, few
mismatchs and gaps
Assumption the observed identity in seqs. to be
aligned is the result of either random or of a
shared evolutionary origin
Identity ? similarity
Sequence identity Homology (a risky assumption)
Sequence identity ? Homology

23
Box 2.1 Pairwise Sequence Alignment
Same true alignment arise through different
evolutionary events Scoring scheme substitution
? -1, indel ? -5, match ? 3
indel
Score 9 5
4
4
Figure A Common evolutionary events and their
effects on alignment
24
Box 2.1 Pairwise Sequence Alignment

Find the optimal score ? the best guess for the
true alignment
Find the optimal pairwise alignment of two seqs.
? inserted gaps into one or both of them ?
maximize the total alignment score
Dynamic programming (DP) Needleman and Wunsch
(1970), Smith and Waterman (1980), this algorithm
guarantees that we find all optimal alignments of
two seqs. of lengths m and n
BLAST is based on DP with improvement on speed
Prof. Waterman http//www.usc.edu/dept/LAS/biosci/
faculty/waterman.html

25
Box 2.1 Pairwise Sequence Alignment
The score for alignment of i residues of sequence
1 against j residues of sequence 2 is given by
where c(i,j) the score for alignment of
residues i and j and takes the value 3 for a
match or -1 for a mismatch, c(-,j) the penalty
for aligning a residue with a gap, which takes
the value of -5
26
Box 2.1 Pairwise Sequence Alignment

The entry for S(1,1) is the maximum of the
following three events
S(0,0) c(A,A) 0 3 3 c(A,A)
c(1,1)
S(0,1) c(A, -) -5 -5 -10 c(A, -)
c(1, -)
S(1,0) c(-, A) -5 -5 -10 c(- ,A)
c(-, 1)
Similarly, one finds S(2,1) as the maximum of
three values (-5)-1-6 3-5-2 and (-10)-5-15
? the best is entry is the addition of the C
indel to the A-A match, for a score of -2 (see
next page).

27
Box 2.1 Pairwise Sequence Alignment
The alignment matrix of sequences 1 and 2
S(2,1) max S(1,0) c(2,1), S(1,1) c(2,-),
S(2,0) c(-,1) max S(1,0) c(C,A), S(1,1)
c(C,-), S(2,0) c(-,A) max -5-1, 3-5,
-10-5 -2
28
Box 2.1 Pairwise Sequence Alignment
Traceback ? determine the actual alignment From
the top right hand corner ? the (7,5) cell
For example the 1 in the (7,5) cell could only be
reached by the addition of the mismatch
A-T ACGCTGA A - - CTGT or ACGCTGA AC - - TGT 4
matches 1 mismatch 2 indels Ambiguity has to
do with which C in seq. 1 aligns with the C in
seq. 2
29
Box 2.1 Pairwise Sequence Alignment

Parameters settings - Gap penalties
Default settings are the easiest to use but they
are not necessarily yield the correct alignment
constant penalty ? independent of the length of
gap, A
proportional penalty ? penalty is proportional to
the length L of the gap, BL (that is what we used
in the this lecture)
affine gap penalty ? gap-opening penalty
gap-extension penalty ABL
There is no rule for predicting the penalty that
best suits the alignment
Optimal penalties vary from seq. to seq. ? it is
a matter of trial and error
Usually A gt B, because of opening a gap (usually
A/B 10)
Hint (1) compare distantly related seqs. high A
and very low B often give the best results ?
penalized more on their existence than on their
length, (2) compare closely related seqs.,
penalize both of extension and extension

30
Exercise 2.2 Computing an optimal sequence
alignment

Two score schemes
Gap penalty -5, mismatch -1, match 3
Gap penalty -1, mismatch -1, match 3
First alignment score 53 2(-1) 13
Second/Third alignment score 63 2(-5)
8
(2) First alignment score 53 2(-1) 13
Second/Third alignment score 63 2(-1)
16
A more serious problem identify the wrong
alignment

31
Exercise 2.2 Computing an optimal sequence
alignment
Gap penalty -5
Gap penalty -1
32
Emerging Sequencing Methods

Costs of genome sequencing
Mid-2000 - 30-50 Million dollars to sequencing a
mammalian genome
Target 1000 per human genome by the year 2010
J. Craig Benter Foundation - 500,000 award for
the first person to achieve this goal
New technologies
Sequencing by hybridization (SBH) detect
whether an exact match is present in a sample of
DNA or not
Mass spectrophotometric technique ionized
fragment, time of flight
Nanopore sequencing strategies - Ultrafast and
relative inexpensive sequencing of long DNA
fragments
Single-molecule approach Solexa, Visigen and
Genovoxx
Single-molecule polony sequencing

33
Figure 2.6 Single-molecule polony sequencing
Emerging Sequencing Methods
Dilute solution of DNA are plated onto a glass
microscope slide. In situ PCR produces thousands
of tiny colonies of DNA, which incorporated of
single dye-labeled dNTPs. Polony PCR colonies
(???) The slide is read after each cycle
of Incorporation of a new base, allowing short
seqs. to be determined. Each numbered polony
produces a short 20-25 nucleotide seq. as
shown. These can then be assembled
computationally into a contiguous seq.
34
Figure 2.7 (Part 1) Hierarchical versus shotgun
sequencing
Genome Sequencing

Whole genome seqs. are assembled from
105 of fragments, each typically between
500 and 1000 bp in length.
Two general approaches for fragmentation
and assembly (1) hierarchical seq. (2) shotgun
seq.
For historical overview, see
http//www.sciencemag.org/feature/plus/sfg/human/t
imeline1.shtml
Hierarchical seq.
First develop a low resolution physical
alignment to measure the seq. is obtained in
large order pieces.
Break the genome into small fragments and use
computer algorithms to assemble them, see Figure
2.7
Most new genome projects adopt the shotgun
approach.

35
Genome Sequencing hierarchical sequencing

Top down, map-based or clone-by-clone strategy
late 1980
Genome ? break into small fragments
The relative locations of the fragments are known
BEFORE sequencing
Advantages
It fostered (help develop) assembly of
high-resolution physical and genetic maps
Allow groups working around the global
Technology for cloning large fragments of genomes
are progressed rapidly throughout the1990s, such
as E. coli, S. cerevisiae, C. elegans. A.
thaliana.
Top-down seq. ? clone seqs. as managable units of
framgments (50 200 kb in length)
Clone vectors BAC (300 kb), PAC (100 kb),
phage-derived cosmids

36
Figure 2.7 (Part 2) Shotgun sequencing
Genome Sequencing Shotgun sequencing
In the shotgun approach, no attempt is made to
order the clones in advance, Instead, the whole
genome is assembled using computer algorithms
that order contigs based on their overlapping
sequences.
37
Figure 2.8 Cloning vectors used in genome
sequencing
Cloning vectors used in genome sequencing
38
Genome Sequencing hierarchical sequencing

DNA libraries
By restriction enzyme (RE) or sonication (??????)
Fragments are ligated into a multiple cloning
site (mcs) in the vector
Aim for 5- to 10-fold redundancy ? larger than 5
to 10 times in the genome library
Each clone will have different ends ? possible to
select a scaffold of clones that forms a
contiguous seq. coverage a tiling (???) path
By aligning the regions of overlap (Fig. 2.9)
The tiling path can be assembled using a
combination of 3 methods (1) hybridization, (2)
fingerprinting, and (3) end-sequencing

39
Figure 2.9 Hierarchical assembly of a
sequence-contig scaffold (supercontig)
Genome Sequencing hierarchical sequencing

A minimal tiling path through a library of
aligned BAC clones that ensures complete coverage
of the chromosome is chosen.
After sequencing independent shotgun libraries
for each BAC.
Small gaps in the sequenced clone contigs remain.
These are closed as far as possible by merging
the two BAC sequences, as well as by the addition
of mate-pair information (yellow) and cDNA
structural information (red), which establishes
the orientation and distance between cloned
segments.

40
Genome Sequencing hierarchical sequencing

Hybridization
All of the clones in a library that carry a
particular seq. can be identified rapidly by
hybridizing a small radioactively or chemically
labeled probe containing the seq. to a filter on
which is printed an array of 10000 of clones
(Fig. 2.10A)
Fingerprinting
Study the Restriction Enzyme (RE) patterns
Assemble contigs of large insert clones is to
compare and
align them according to RE
RE 6 bp ? 46 212 4000 bp
For BAC, 100 kb ? 100 kb/4 kbp 20 30
fragments
these fragments can be separated by
electrophesis ? Fingerprint profile ? BAC
alignment by gel ? software alignment ?
overlapping ? Contigs ? assemble of Mb length
contigs

41
Figure 2.10 Aligning BAC clones by hybridization
and fingerprinting
Genome Sequencing hierarchical sequencing

(A) A macroarray of BAC clones is probed
with a short, radioactive fragment to
identify all BACs that carry a specific
fragment.
These clones are digested with a RE, end-
labeled, and separated by gel electrophoresis,
Software converts the bands to a virtual
profile, shown hypothetically for a small
portion of four bands (high-ligated box in
part B). Shared bands (red or blue) imply
that the two clones share the same seq.
Green indicates the vector band common to
all clones.
The fingerprint profile is then converted into
a BAC alignment, In this example, clone 2
does not share any bands with the others and
so is placed into a seq. BAC contig, while the
other three clones form a tiling path.

42
Genome Sequencing hierarchical sequencing

End-sequencing
Fill in the gaps after fingerprinting. How ?
sequencing both ends of the collection of BAC
clones
Once a critical threshold of seqs. have been
achieved ? overlap
For example, along a 10 Mbp genome, end seqs. of
10,000 BAC clones, ? provide a seq. tag every 5kb
(for a 5-fold coverage)
Along a 10 Mbp genome ?10 Mbp/10000 BAC ? 1
kbp/BAC
Five fold ? 10 Mb/2000 BAC 5 kb (a seq. tag
distance)
Given this tag density, it is possible to close
gap lt 50 kb
Once the Tiling path is chosen ? shotgun the BAC
clones into small fragments
Subcloning, use M13 phagemid (1 kb, exist as
dsDNA and ssDNA
or clone 2 3 kb fragments into a plasmid vector

43
Genome Sequencing Shotgun sequencing

Use computer algorithm to assemble the seqs.
(100,000)
About 5 10 folds redundancy for each fragment
Library - From a single whole genome
After MSA ? screen out repetitive seqs., overlap
reads of the same seq. ? generate
unitigs and scaffolds ? gt90 of the seqs. are
assembled
Finishing phase closing gaps, cleaning up
ambiguities ? take as much time as
the shotgun phase
Users are asked to trust the assemblies
Celera Genomics used the following software to
assemble the seqs.
Screener to mask (not removed) seqs. that
contain repetitive DNA
(such as microsatellites, LINE, Alu repeats,
retrotransposons and ribosomal DNA)
Overlapper compares every unscreened read
against every other unscreened read,
searching for overlaps of a predetermined length
and identity.
Parallel processing on 40 supercomputers, each
with 4GB RAM, allowed the 27 M
screened human seqs. reads to be overlapped in lt
5 days !
Repeat-induced overlaps of a seq. are resolved
using the Unitigger (see Figure 2.11).
Scaffolder uses mate-pair information to link
U-unitigs into scaffold contigs

44
Genome Sequencing Shotgun sequencing

Figure 2.11
Seq. alignment between two or more shotgun clones
can arise between unique seqs. (left) or
repetitive seqs. (right).
(B) The Overlapper aligns unitigs, which are
identified as unique seq. alignments (U-untigs)
or overcollapsed repeats (blue).
Two contigs can be aligned and
oriented by using mate-pair seq.
information from the ends of longer (10- or
50-kb) clones, as shown at the bottom, while
mate-pairs from 2-kb fragments allow assembly of
scaffolds despite the presence of simple repeats
such as microsatellites (blue) that are masked
before performing alignments.

Figure 2.11 U-unitigs and repeat resolution
45
Genome Sequencing Shotgun sequencing
Figure 2.12 shows the estimated coverage of the
fly and human whole genomes after initial
assembly in both cases, 84 or more of the
genomes was covered by scaffolds at least 100 kb
in length, while most scaffolds were in the Mb
range. ? seq. coverage from 5x to 10x ? a 10 ?
in the proportion of scaffolds of lengths up to 1
Mb. The plot shows the percentage of Scaffolds
that have a length greater than that indicated
for the fly 10x, human 8x (CSA) and human 5x
(whole genome assembly WGA) seqs. generated by
Celera. The fly and CSA assemblies include
shredded (????) seqs. generated from BAC clones
by public genomes sequencing efforts.
Figure 2.12 Proportion of fly and human genomes
in large scaffolds
46