Title: Genome Sequencing and Annotation
1Genome Sequencing and Annotation
2- The first objective of most genome projects is to
determine the DNA sequence either of the genome
or of a large number of transcripts.
3Automated DNA Sequencing
- The Principle of Sanger Sequencing
4ddNTP
5(No Transcript)
6Isotope label 33P or 35S
7- With a separation of 0.5 mm between bands
corresponding to each molecule that differs in
length by one nucleotide. - In general, only up to 500 bases could be read
from a single set of four lanes on a 30-cm gel. - Its difficult to sequencing in GC-rich region.
8High-throughput Sequencing
9- Four-color fluorescent dyes have replaced the
radioactive label. - Rather than stopping the electrophoresis at a
particular time, the products are scanned for
laser-induced fluorescence just before they run
off the end of the electrophoresis medium. - Reads greater than 1200 bp in a run are possible
with current technology, though the 500-900 bp
range is more common. - Capillary electrophoresis
10Anion carrier
11ABI 3700 Sequencer
Twenty 96-well plats are handle per day 2Mb of
sequence in length be determined.
12Reading Sequence Traces
- The reading of raw sequence traces, or
base-calling, is now routinely performed using
automated software that read bases, aligns
similar sequences, and provides an intuitive
platform for editing. - phred free software
- http//www.phrap.org/
13(No Transcript)
142.2 The phred base-calling algorithm
152.3 Automated sequence chromatograms
lt 50 bp
5 ? 3
gt 800 bp
16Contig Assembly
Optimizing sequence alignment with highest score
17Box 2.1 Pairwise Sequence Alignment
182.5 An aligned-reads window in consed
Phrap assembler and consed graphic editor
http//www.phrap.org/
19Several sequencing technology have been proposed
to reduce the cost.
20Emerging Sequencing Methods
- Sequencing by hybridization (SBH)
- Mass spectrophotometric techniques
- Nanopore sequencing strategies
- Single-molecule Sanger sequencing
- Polony sequencing
21(No Transcript)
22Mass spectrophotometric techniques
23Mass spectrophotometric techniques DNA
sequencing use the same prinple
24Nanopore sequencing strategies
25http//arep.med.harvard.edu/Polonator/chem/
A
C
G
U
262.6 Single-molecule polony sequencing
http//arep.med.harvard.edu/Polonator/
27Genome Sequencing
- Whole chromosome sequences are reassembled from
the sequences of hundreds of thousands of
fragments, each typically between 500 and 1000 bp
in length. - Two general strategies
- 1. Hierarchical sequencing
- 2. Shotgun sequencing
282.7 (Part 1) Hierarchical versus shotgun
sequencing
292.7 (Part 2) Hierarchical versus shotgun
sequencing
30Hierarchical sequencing
- Also known as top-down, map-based, or
clone-by-clone strategy, developed in the late
1980s. - The first step is to clone the genome as
manageable units of some 50-200 kb in length,
which relative locations are known. - DNA libraries are constructed by partial
digestion or shearing of genomic DNA and are
ligated into cloning site using standard
recombinant DNA procedurce.
31The copy number close to one per cell
322.8 Cloning vectors used in genome sequencing
332.9 Hierarchical assembly of a sequence-contig
scaffold (supercontig) --- tiling path
5- to 10-fold redundancy
- Assembly methods
- Hybridization (A)
- Fingerprinting
- (B,C,D)
- 3. End-sequencing
342.10 Aligning BAC clones by hybridization and
fingerprinting
35End-sequencing
36Shotgun sequencing
- Sequence fragments with 5- to 10-fold redundancy
are generated from a plasmid library that
constructed from a single whole genome. - The contig scaffold are produced by Celera
Genomics using computer algorithm such as
Screener, Overlapper, and Unitigger.
372.11 U-unitigs and repeat resolution
Overlaps of at least 40 bp with no more 6
differences were accepted.
382.12 Proportion of fly and human genomes in
large scaffolds
39Sequence Verification
- Completeness
- Accuracy
- Validity of assembly
402.13 Alignment of two draft human genome
assemblies
41Genome Annotation
42EST Sequencing
43EST
442.14 (Part 1) Relationship between gene
structure, cDNA, and EST sequences
Alternative splicing
45- Alternative splicing of mRNA
- Splice pair
GU donor site
AG acceptor site
Exon Enhancer Motif
46- Four Modes of Alternative Splicing
Splice / Don't Splice Â
Competing 5' or 3' Splice SitesÂ
47 Exon Skipping Â
Mutually Exclusive ExonsÂ
48Ab Initio Gene Discovery
To predict gene according to sequence pattern
Flanking region
Flanking region
5'
3'
GT
AG
GT
AG
GC box
Initiation codon
Stop codon
Poly(A)-addition site
CAAT box
TSS
AATAA
GC box
TATA box
49- Statistics-based (ab initio) Methods
- Ab initio approach predict genes directly
using the computational properties of exons,
introns, and other features in the genomic
sequences without the reference of the
experimental data. - (1) Hidden Markov Model
- GenScan, Genie, Genemark, Veil,
HMMgene - (2) Neural Networks
- Grail II, GrailEXP_Perceval
- (3) Decision Tree
- MZEF, MZEF-SPC
- (4) Integration of Various Statistical
Approaches - FGENESH
50Example Two dices1
S1
S2
- One dice is fair, producing 1,2,3,4,5,6 with
equal probability. The other one is loaded and
produce number 6 with higher probability than
the rest numbers. - Each time only one dice is rolled and produces
numbers in the set 1,2,3,4,5,6 which can be
observed. But we do not know which of the two
dices are rolled. - The transitions between the fair dice and loaded
dice follow the Markov chain properties.
0.9
0.9
0.1
fair
loaded
0.1
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
6 4 5 3 1 3 2 6 5 1 4 5 6 3 6 6 6 4 6 3 1 6 5 6 6
6 2 4 5 3
F F F F F F F F F F F F L L L L L L L L L L L L L
L F F F F
51Elements of a hidden markov model
- An HMM is characterized by the follow elements
- N, the number of hidden states. S1, S2, SN
- ( N 2 Fair dice, loaded dice )
- M, the number of distinct observations. V1, V2,
VM - (M 6 1, 2, 3, 4, 5, 6)
- Transition probabilities between the hidden
states - aij P(Sj Si is the previous state) for each
i, j from 1 to N. - ( P(FF) 0.9, P(FL)0.1, P(LL) 0.9, P(LF)
0.1 ) - Emission probability of each observation for each
hidden state bi(k) P(VkSi) - (bFair(k) 1/6 for k 1,2,3,4,5,6
bLoaded(k)1/10 for k 1,2,3,4,5 and bLoaded(6)
1/2) - Initial probabilities probability of each state
being the first state.
52Modeling Exon-Intron Segments
- We can treat the exon and intron segments on DNA
sequences as two distinct hidden states, each of
which emits A, T, C, G with its own
probabilities.
S1
S2
0.9
0.9
0.1
exon
intron
0.1
A 1/6 T 1/6 C 1/3 G 1/3
A 1/4 T 1/4 C 1/4 G 1/4
53Modeling Exon-Intron Segments
- Number of hidden states 2 (exon state and intron
state) - Number of observations 4 (A, T, C, G)
- Transition probabilities
- P(S1?S2) 0.1
- P(S1?S1) 0.9
- P(S2?S1) 0.1
- P(S2?S2) 0.9
- Emission probabilities
- S1 P(A)P(T) 1/6
- P(C)P(G) 1/3
- S2 P(A)P(T)P(C)P(G)1/4
S1
S2
0.9
0.9
0.1
exon
intron
0.1
A 1/6 T 1/6 C 1/3 G 1/3
A 1/4 T 1/4 C 1/4 G 1/4
54Neural Networks
55- Homology-based Methods
- Homological approach identifies genes
with the aid of experimental data. This approach
exploits the alignment gene sequence between
genomic data and the known cDNA (or protein)
database. - (1) Local Alignment Methods (BLAST
-based) - AAT, GAIA , INFO, TAP
- (2) Pattern-based Alignment Method
- Flash, ICE
-
56- Combination Tools
- These tools combine both sequence similarity
and ab initio gene finding approaches. They
predict genes by producing a splicing alignment
between a genomic sequence and a candidate amino
acid sequence. - Procrustes, GeneWise, GenomeScan
- FGENESH and FGENESH
- GrailEXP_Gawain and _GALAHAD
57Evaluate accuracy
TPP
TPE
FP
FN
TN
TP(True Positive) correctly predicted as
coding TN(True Negative) correctly predicted as
noncoding FP(False Positive) noncodingÃ
coding FN(False Negative) coding à noncoding
58- Sensitivity
- SnTPE/(TPFN)
- Specificity
- SpTPE/(TPFP)
- Missing exon
- ME (number of missing exons)/(number
of actual exons) - Wrong exon
- WE(number of wrong exons)/(number of
predicted exons)
59Accuracy of exon prediction
- Statistics-based (ab initio) Methods
- Sensitity/Specificity 30 70
- Homology-based Methods
- Sensitity/Specificity 99
- when related protein sequences available
- Combination Tools
- ????????
60Regulatory Sequences
Phylogenetic footprinting two sequence
alignment and conserved pattern discovery, in
principle.
612.15 Phylogenetic shadowing
Uncover the conserved segments with multiple
sequence alignment
Candidate of Regulatory sequence
Larger diversity, Non-regulatory sequence
62Non-Protein Coding Genes
- It is difficult to identify non-protein coding
genes - Most of the transcripts are not polyadenylated,
and so lack of cDNA libraries. - Conserved in the level of secondary structure,
poor sequence constraint. - Relatively little is known about the function and
distribution.
63Structural Features of Genome Sequences
- Repetitive sequences (five classes)
- GC content
- Simple sequence repeats
- Segmental duplication
- Structure of centromeres and telomeres
64The ENCODE Project
- Only 5 of mammalian genome sequence is highly
conserved. - At most, only 2 of a typical mammalian genome
encode transcripts. - Most of the non-genic sequences are unknown.
65(No Transcript)
66Clusters of Orthologous Genes (COGs)
paralog
ortholog
67http//www.ncbi.nlm.nih.gov/COG/
68Gene Ontology
69(No Transcript)
70http//www.geneontology.org/