Title: Gibson Second Edition
1Genome Sequencing and Annotation (Part 1)
2- Objective of most genome projects
- Sequencing DNA, mRNA
- Identify genes characterize gene features
- This chapter
- How blocks of DNA seqs. are obtained
- How these blocks are assembled into contigs then
genomes - Bioinformatics how to do seq. alignment, such
as cDNA/EST, genome seqs. - Annotation of ORF,
- Other features of gene repetition elements,
variable distribution of GC content, evolutionary
conserved elements - Gene annotation by cross species annotation
32.1 (Part 2) The principle of dideoxy (Sanger)
sequencing
Automated DNA sequencing
1974, F. Sanger developed the chain-termination
method (Sanger sequencing) Sanger won his second
Noble prize for inventing this process
4Automated DNA sequencing
- Most current sequencing projects use the chain
termination method - Also known as Sanger sequencing, after its
inventor - Based on action of DNA polymerase
- Adds nucleotides to complementary strand
- Requires template DNA and primer
5Chain-termination sequencing
- Dideoxynucleotides (ddA, ddT, ddC or ddG) stop
synthesis - Chain terminators (DNA polymerase cannot add
another nucleotide) - Included in amounts so as to terminate every time
the base appears in the template - Use four reactions
- One for each base A,C,G, and T
Template
3 ATCGGTGCATAGCTTGT 5 5 TAGCCACGTATCGAACA
3 5 TAGCCACGTATCGAA 3 5 TAGCCACGTATCGA
3 5 TAGCCACGTA 3 5 TAGCCA 3 5 TA 3
Sequence reaction products
6Sequence detection
- To detect products of sequencing reaction
- Include labeled nucleotides
- Formerly, radioactive labels (33P or 35S) were
used - Now fluorescent labels
- Use different fluorescent tag for each nucleotide
- Can run all four reactions in a single gel lane
or capillary tube
TAGCCACGTATCGAA TAGCCACGTATC TAGCCACG TAG
CCACGT
7Sequence separation
Sequence separation
- Terminated chains need to be separated
- Requires one-base-pair resolution
- See difference between chains of X and X1 base
pairs - Gel electrophoresis
- Very thin gel
- High voltage applied
- Works with radioactive or fluorescent labels
- Negative pole at the top
C A G T C A G T
8Sequence reading of radioactively labeled
reactions
- The final step of sequencing is to read the
sequence - Radioactive labeled reactions
- Gel dried
- Placed on X-ray film
- Film developed, the position of each band becomes
visible - Sequence read from bottom up (the positive pole)
- Each of the four lanes giving the position of a
different base A, T, C or G
9Sequence reading of fluorescently labeled
reactions
- Fluorescently labeled reactions scanned by laser
as particular point is passed - Color picked up by detector
- Output sent directly to computer
- The read out is given both in terms of bases and
the intensity of each color, so that ambiguous
readings are easily identified
10Summary of chain termination sequencing
A primer is extended by DNA polymerase based on
the sequence present in the template strand. The
chain is terminated by different ddNTP that are
complementary to the template strand. Four
reactions are separated on a gel that can resolve
one-base differences. The seq. is then read from
the bottom of gel to the top.
11High-Throughput Sequencing
- The new techniques and equipment include
- (1) Four-color fluorescent dyes have replaced the
radioactive label - (2) Rather than stopping the electrophoresis at a
particular time, the products are scanned for
laser-induced fluorescence just before the run
off the end of the electrophoresis medium - (3) Improvements in the chemistry of template
purification and the sequencing reaction - (4) Slab gel electrophoresis gave way to
capillary electrophoresis with the introduction
in 1999 of Applied Biosystems ABI Prism 3700
automated sequencers, which in turn were updated
with ABI Prism 3730 DNA analyzers in 2003
(deliver extremely high quality, long reads save
time and money)
ABI Prism 3730 DNA analyzers
12Reading sequence traces
- Base-calling the reading of raw sequence traces
- Now routinely performed using automated software
that reads bases, aligns similar seqs. and
editing - Program phred http//www.phrap.org
- The program assign probability scores to the
accuracy of each base call as the trace is read
132.3 Automated sequence chromatograms
- This seq. shows noiseness of the first 30 bp of
a run. - The middle two rows show a segment of two seqs.
that are polymorphic for both SNPs and an indel. - A decline in seq. quality typically occurs after
about 800 bp.
14- Ex. 2.1 Reading a sequence trace
The base labeled N due to poor seq. quality Two
peaks of the same height are observed at the same
location, the site is heterozygous for a C and T
SNP.
15Figure 2.5 An aligned-reads window in consed
Contig Assembly
16Assembling DNA seq. fragments
- NCBI dbest databases http//www.ncbi.nlm.nih.gov/D
atabase/ - View the EST statistics
- FTP EST files
17Assembling DNA seq. fragments
- IFOM assembler
- http//bio.ifom-firc.it/ASSEMBLY/assemble.html
- Multiple EST seqs. ? contig
- max. number of seqs. you can enter is 10000 !!
- use gi(15744427, 19124086, 8147732, 8147734,
20393914,13728017) - Length (850, 1062, 634, 596, 869, 768) bp
- resulting in a single contig consensus seq., can
be used for similarity search against db
18Assembling DNA seq. fragments 6 GI fragments
- gtgi15744427gbBI752849.1BI752849 603022060F1
NIH_MGC_114 Homo sapiens cDNA clone IMAGE5192510
5', mRNA sequenceCGGGGTGCTGCGAGCGCGGGGCCAGACCAAGGC
GGGCCCGGAGCGGAACTTCGGTCCCAGCTCGGTCCCCGGCTCAGTCCCGA
CGTGGAACTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATT
CCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGT
CCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACG
AATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGG
CTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGAC
AGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTA
CCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTA
ACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTG
AAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCC
TTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAG
CAGTGCGGCCCCTGGAGCTGGCTACTGCCTGCAGAAGTGCAACGTGCCCT
TGTTTGTCCAACATGATGCTGCCAACTGTACCTCAAACTCTGGAACCTGA
TTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTG
TATATGATCCGGGTGAAGGACTCCTTGATATGCGTTGACTGTGCCATGGG
AGAGTAGCAGAAAACAGCAGCATGCTCAACCTCCCACTTTCTCTATTGGA
TGTGGACTCAAAGCCCT - gtgi19124086gbBM807263.1BM807263
AGENCOURT_6574903 NIH_MGC_124 Homo sapiens cDNA
clone IMAGE5732238 5', mRNA sequenceGTCCGGAATTCCC
GGGATCTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCC
ATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCC
TGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAA
TGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCT
GAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAG
CAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACC
CTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAAC
TCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAA
GAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTT
TCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCA
GTGCGGCCCCTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTT
GTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGA
TTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTG
TATACGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGA
GAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATG
TGGACTCAAAGCCCCTGGAAGACACTGGAGGACGCCCTGCACTGCTTCTT
CCAGCCCAGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGG
GAAGAAGACCCGCGGGGAACAGGGTCCTGAAACCTGACCATTTTGCCCCA
GACCTTGACCAATCCACCTCATGGCGATTCTCCCTCCAGGAATTCCCCGA
CCGAGAAAAAATTGGCCACTTCCCCGGAATTTCCCCCCAAAAACTTGGAA
TTTCACCCAAAACCTTTCCCATGTAAACCCGGAAACCCTGGGGAAGGCT - gtgi8147732gbAW958049.1AW958049 EST370119 MAGE
resequences, MAGE Homo sapiens cDNA, mRNA
sequenceGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCACGAGTG
GAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACA
TGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGA
TCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGG
GTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAA
CAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGC
CCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAG
TTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCG
TGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCC
ACCTCATGCGATTCTTCATCAGGAATTCACAGACGAGAAAGATCTGCCAC
TCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGAACCTTCCAATGAA
GCGAGAATCTTGTGAAGCTGAAGAACAGTCTGGAAGGCAAGATGAGCTTT
TTGCTGGGAATGCGCACGTGGAAAGGCAGAATTCGGTCATAA - gtgi8147734gbAW958051.1AW958051 EST370121 MAGE
resequences, MAGE Homo sapiens cDNA, mRNA
sequenceGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTT
TGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTA
AGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTAT
ATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAG
TAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGG
ACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAG
CCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAA
GAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCC
TGACAATCCACCTTATGCGATTCTCCATCAGGAATTCACAGACGAGAAAG
ATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCT
TCCAATGAAGCGAGAGTCTTGTGATGCTTGAGGAGCAATCTGGAGGGCAT
ATGAGCTTTTTGCTGTGATTGCGCACCTGGGAATGCAAAACTCCGTCATT
ACTG - gtgi20393914gbBQ213074.1BQ213074
AGENCOURT_7559959 NIH_MGC_72 Homo sapiens cDNA
clone IMAGE6055692 5', mRNA sequenceAGATCTGCCACTC
CCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGC
GAGAGTCTTGTGATGCTGAGGAGCAGTCTGGAGGGCAGTATGAGCTTTTT
GCTGTGATTGCGCACGTGGGAATGGCAGACTCCGGTCATTACTGTGTCTA
CATCCGGAATGCTGTGGATGGAAAATGGTTCTGCTTCAATGACTCCAATA
TTTGCTTGGTGTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAAC
TACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTG
CTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCA
TTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAG
AAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATA
TTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAAT
GGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGC
TCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTT
CAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGGC
CAAGGGCAGTGGCAGGGGGTATTTCAGTATTATACCACTGCTGTGACCAG
ACTTGTATACTGGCTGAATATCAGGGCTGGTTGTAATTTTTTCCCTTTGA
AGAAACACCATTAATTTCCTAATGAATCCAAGTGGTTTGTAACTTGCCTA
TTCCTTTTATTCCAGCAAAAAATTAATTGATCATCCCCTCCCCCAAAAAA
TAGGGG - gtgi13728017gbBG206330.1BG206330 RST25778
Athersys RAGE Library Homo sapiens cDNA, mRNA
sequenceTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTAC
CACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTA
ATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTT
CCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAG
CATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTA
TCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGA
TGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCG
CTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAG
GGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGCCAAA
GGTCAGTGGCAGGGGGTATTTCAGTATTATACAACTGCTGTGACCAGACT
TGTATACTGGCTGAATATCAGTGCTGTTTGTAATTTTTCACTTTGAGAAC
CAACATTAATTCCATATGAATCAAGTGTTTTGGAACTGCTATTCATTTAT
TCAGCAAATATTTATTGGTCATCTTTTCTCCATAAGATAGTGTGATAAAC
ACAGCATGAATAAAGGTATTTTCCACACAGACAAGTGTTTTTTCACAAAA
TTATTNATTTTGNTGGGGCTGTGGCGGCCGCTTCCTTTATGGGGGGGAAT
TTAGAACCCGTTCCTGACGCGGGGGN
19Assembling DNA seq. fragments
20Assembling DNA seq. fragments
21Assembling DNA seq. fragments
Assembled mRNA sequence
22Box 2.1 Pairwise Sequence Alignment
- The most important class of bioinformatics tools
pairwise alignment of DNA and protein seqs. - alignment 1 alignment 2
- Seq. 1 ACGCTGA ACGCTGA
- Seq. 2 A - - CTGT ACTGT - -
- Seeks alignments ? high seq. identity, few
mismatchs and gaps - Assumption the observed identity in seqs. to be
aligned is the result of either random or of a
shared evolutionary origin - Identity ? similarity
- Sequence identity Homology (a risky assumption)
- Sequence identity ? Homology
23Box 2.1 Pairwise Sequence Alignment
Same true alignment arise through different
evolutionary events Scoring scheme substitution
? -1, indel ? -5, match ? 3
indel
Score 9 5
4
4
Figure A Common evolutionary events and their
effects on alignment
24Box 2.1 Pairwise Sequence Alignment
- Find the optimal score ? the best guess for the
true alignment - Find the optimal pairwise alignment of two seqs.
? inserted gaps into one or both of them ?
maximize the total alignment score - Dynamic programming (DP) Needleman and Wunsch
(1970), Smith and Waterman (1980), this algorithm
guarantees that we find all optimal alignments of
two seqs. of lengths m and n - BLAST is based on DP with improvement on speed
- Prof. Waterman http//www.usc.edu/dept/LAS/biosci/
faculty/waterman.html
25Box 2.1 Pairwise Sequence Alignment
The score for alignment of i residues of sequence
1 against j residues of sequence 2 is given by
where c(i,j) the score for alignment of
residues i and j and takes the value 3 for a
match or -1 for a mismatch, c(-,j) the penalty
for aligning a residue with a gap, which takes
the value of -5
26Box 2.1 Pairwise Sequence Alignment
- The entry for S(1,1) is the maximum of the
following three events - S(0,0) c(A,A) 0 3 3 c(A,A)
c(1,1) - S(0,1) c(A, -) -5 -5 -10 c(A, -)
c(1, -) - S(1,0) c(-, A) -5 -5 -10 c(- ,A)
c(-, 1) - Similarly, one finds S(2,1) as the maximum of
three values (-5)-1-6 3-5-2 and (-10)-5-15
? the best is entry is the addition of the C
indel to the A-A match, for a score of -2 (see
next page).
27Box 2.1 Pairwise Sequence Alignment
The alignment matrix of sequences 1 and 2
S(2,1) max S(1,0) c(2,1), S(1,1) c(2,-),
S(2,0) c(-,1) max S(1,0) c(C,A), S(1,1)
c(C,-), S(2,0) c(-,A) max -5-1, 3-5,
-10-5 -2
28Box 2.1 Pairwise Sequence Alignment
Traceback ? determine the actual alignment From
the top right hand corner ? the (7,5) cell
For example the 1 in the (7,5) cell could only be
reached by the addition of the mismatch
A-T ACGCTGA A - - CTGT or ACGCTGA AC - - TGT 4
matches 1 mismatch 2 indels Ambiguity has to
do with which C in seq. 1 aligns with the C in
seq. 2
29Box 2.1 Pairwise Sequence Alignment
- Parameters settings - Gap penalties
- Default settings are the easiest to use but they
are not necessarily yield the correct alignment - constant penalty ? independent of the length of
gap, A - proportional penalty ? penalty is proportional to
the length L of the gap, BL (that is what we used
in the this lecture) - affine gap penalty ? gap-opening penalty
gap-extension penalty ABL - There is no rule for predicting the penalty that
best suits the alignment - Optimal penalties vary from seq. to seq. ? it is
a matter of trial and error - Usually A gt B, because of opening a gap (usually
A/B 10) - Hint (1) compare distantly related seqs. high A
and very low B often give the best results ?
penalized more on their existence than on their
length, (2) compare closely related seqs.,
penalize both of extension and extension
30Exercise 2.2 Computing an optimal sequence
alignment
- Two score schemes
- Gap penalty -5, mismatch -1, match 3
- Gap penalty -1, mismatch -1, match 3
- First alignment score 53 2(-1) 13
- Second/Third alignment score 63 2(-5)
8 - (2) First alignment score 53 2(-1) 13
- Second/Third alignment score 63 2(-1)
16 - A more serious problem identify the wrong
alignment
31Exercise 2.2 Computing an optimal sequence
alignment
Gap penalty -5
Gap penalty -1
32Emerging Sequencing Methods
- Costs of genome sequencing
- Mid-2000 - 30-50 Million dollars to sequencing a
mammalian genome - Target 1000 per human genome by the year 2010
- J. Craig Benter Foundation - 500,000 award for
the first person to achieve this goal - New technologies
- Sequencing by hybridization (SBH) detect
whether an exact match is present in a sample of
DNA or not - Mass spectrophotometric technique ionized
fragment, time of flight - Nanopore sequencing strategies - Ultrafast and
relative inexpensive sequencing of long DNA
fragments - Single-molecule approach Solexa, Visigen and
Genovoxx - Single-molecule polony sequencing
33Figure 2.6 Single-molecule polony sequencing
Emerging Sequencing Methods
Dilute solution of DNA are plated onto a glass
microscope slide. In situ PCR produces thousands
of tiny colonies of DNA, which incorporated of
single dye-labeled dNTPs. Polony PCR colonies
(???) The slide is read after each cycle
of Incorporation of a new base, allowing short
seqs. to be determined. Each numbered polony
produces a short 20-25 nucleotide seq. as
shown. These can then be assembled
computationally into a contiguous seq.
34Figure 2.7 (Part 1) Hierarchical versus shotgun
sequencing
Genome Sequencing
- Whole genome seqs. are assembled from
- 105 of fragments, each typically between
- 500 and 1000 bp in length.
- Two general approaches for fragmentation
- and assembly (1) hierarchical seq. (2) shotgun
- seq.
- For historical overview, see
- http//www.sciencemag.org/feature/plus/sfg/human/t
imeline1.shtml - Hierarchical seq.
- First develop a low resolution physical
alignment to measure the seq. is obtained in
large order pieces. - Break the genome into small fragments and use
computer algorithms to assemble them, see Figure
2.7 - Most new genome projects adopt the shotgun
approach.
35Genome Sequencing hierarchical sequencing
- Top down, map-based or clone-by-clone strategy
- late 1980
- Genome ? break into small fragments
- The relative locations of the fragments are known
BEFORE sequencing - Advantages
- It fostered (help develop) assembly of
high-resolution physical and genetic maps - Allow groups working around the global
- Technology for cloning large fragments of genomes
are progressed rapidly throughout the1990s, such
as E. coli, S. cerevisiae, C. elegans. A.
thaliana. - Top-down seq. ? clone seqs. as managable units of
framgments (50 200 kb in length) - Clone vectors BAC (300 kb), PAC (100 kb),
phage-derived cosmids
36Figure 2.7 (Part 2) Shotgun sequencing
Genome Sequencing Shotgun sequencing
In the shotgun approach, no attempt is made to
order the clones in advance, Instead, the whole
genome is assembled using computer algorithms
that order contigs based on their overlapping
sequences.
37Figure 2.8 Cloning vectors used in genome
sequencing
Cloning vectors used in genome sequencing
38Genome Sequencing hierarchical sequencing
- DNA libraries
- By restriction enzyme (RE) or sonication (??????)
- Fragments are ligated into a multiple cloning
site (mcs) in the vector - Aim for 5- to 10-fold redundancy ? larger than 5
to 10 times in the genome library - Each clone will have different ends ? possible to
select a scaffold of clones that forms a
contiguous seq. coverage a tiling (???) path - By aligning the regions of overlap (Fig. 2.9)
- The tiling path can be assembled using a
combination of 3 methods (1) hybridization, (2)
fingerprinting, and (3) end-sequencing
39Figure 2.9 Hierarchical assembly of a
sequence-contig scaffold (supercontig)
Genome Sequencing hierarchical sequencing
- A minimal tiling path through a library of
aligned BAC clones that ensures complete coverage
of the chromosome is chosen. - After sequencing independent shotgun libraries
for each BAC. - Small gaps in the sequenced clone contigs remain.
- These are closed as far as possible by merging
the two BAC sequences, as well as by the addition
of mate-pair information (yellow) and cDNA
structural information (red), which establishes
the orientation and distance between cloned
segments.
40Genome Sequencing hierarchical sequencing
- Hybridization
- All of the clones in a library that carry a
particular seq. can be identified rapidly by
hybridizing a small radioactively or chemically
labeled probe containing the seq. to a filter on
which is printed an array of 10000 of clones
(Fig. 2.10A) - Fingerprinting
- Study the Restriction Enzyme (RE) patterns
- Assemble contigs of large insert clones is to
compare and - align them according to RE
- RE 6 bp ? 46 212 4000 bp
- For BAC, 100 kb ? 100 kb/4 kbp 20 30
fragments - these fragments can be separated by
electrophesis ? Fingerprint profile ? BAC
alignment by gel ? software alignment ?
overlapping ? Contigs ? assemble of Mb length
contigs
41Figure 2.10 Aligning BAC clones by hybridization
and fingerprinting
Genome Sequencing hierarchical sequencing
- (A) A macroarray of BAC clones is probed
- with a short, radioactive fragment to
- identify all BACs that carry a specific
- fragment.
- These clones are digested with a RE, end-
- labeled, and separated by gel electrophoresis,
- Software converts the bands to a virtual
- profile, shown hypothetically for a small
- portion of four bands (high-ligated box in
- part B). Shared bands (red or blue) imply
- that the two clones share the same seq.
- Green indicates the vector band common to
- all clones.
- The fingerprint profile is then converted into
- a BAC alignment, In this example, clone 2
- does not share any bands with the others and
- so is placed into a seq. BAC contig, while the
- other three clones form a tiling path.
42Genome Sequencing hierarchical sequencing
- End-sequencing
- Fill in the gaps after fingerprinting. How ?
- sequencing both ends of the collection of BAC
clones - Once a critical threshold of seqs. have been
achieved ? overlap - For example, along a 10 Mbp genome, end seqs. of
10,000 BAC clones, ? provide a seq. tag every 5kb
(for a 5-fold coverage) - Along a 10 Mbp genome ?10 Mbp/10000 BAC ? 1
kbp/BAC - Five fold ? 10 Mb/2000 BAC 5 kb (a seq. tag
distance) - Given this tag density, it is possible to close
gap lt 50 kb - Once the Tiling path is chosen ? shotgun the BAC
clones into small fragments - Subcloning, use M13 phagemid (1 kb, exist as
dsDNA and ssDNA - or clone 2 3 kb fragments into a plasmid vector
43Genome Sequencing Shotgun sequencing
- Use computer algorithm to assemble the seqs.
(100,000) - About 5 10 folds redundancy for each fragment
- Library - From a single whole genome
- After MSA ? screen out repetitive seqs., overlap
reads of the same seq. ? generate - unitigs and scaffolds ? gt90 of the seqs. are
assembled - Finishing phase closing gaps, cleaning up
ambiguities ? take as much time as - the shotgun phase
- Users are asked to trust the assemblies
- Celera Genomics used the following software to
assemble the seqs. - Screener to mask (not removed) seqs. that
contain repetitive DNA - (such as microsatellites, LINE, Alu repeats,
retrotransposons and ribosomal DNA) - Overlapper compares every unscreened read
against every other unscreened read, - searching for overlaps of a predetermined length
and identity. - Parallel processing on 40 supercomputers, each
with 4GB RAM, allowed the 27 M - screened human seqs. reads to be overlapped in lt
5 days ! - Repeat-induced overlaps of a seq. are resolved
using the Unitigger (see Figure 2.11). - Scaffolder uses mate-pair information to link
U-unitigs into scaffold contigs
44Genome Sequencing Shotgun sequencing
- Figure 2.11
- Seq. alignment between two or more shotgun clones
can arise between unique seqs. (left) or
repetitive seqs. (right). - (B) The Overlapper aligns unitigs, which are
identified as unique seq. alignments (U-untigs)
or overcollapsed repeats (blue). - Two contigs can be aligned and
- oriented by using mate-pair seq.
- information from the ends of longer (10- or
50-kb) clones, as shown at the bottom, while
mate-pairs from 2-kb fragments allow assembly of
scaffolds despite the presence of simple repeats
such as microsatellites (blue) that are masked
before performing alignments.
Figure 2.11 U-unitigs and repeat resolution
45Genome Sequencing Shotgun sequencing
Figure 2.12 shows the estimated coverage of the
fly and human whole genomes after initial
assembly in both cases, 84 or more of the
genomes was covered by scaffolds at least 100 kb
in length, while most scaffolds were in the Mb
range. ? seq. coverage from 5x to 10x ? a 10 ?
in the proportion of scaffolds of lengths up to 1
Mb. The plot shows the percentage of Scaffolds
that have a length greater than that indicated
for the fly 10x, human 8x (CSA) and human 5x
(whole genome assembly WGA) seqs. generated by
Celera. The fly and CSA assemblies include
shredded (????) seqs. generated from BAC clones
by public genomes sequencing efforts.
Figure 2.12 Proportion of fly and human genomes
in large scaffolds
46- NCTS http//math.cts.nthu.edu.tw/Mathematics/confe
rence-PT2005.html - UCSD
- http//research.calit2.net/recomb-workshop05/