Genome Sequencing and Annotation

About This Presentation

Title:

Genome Sequencing and Annotation

Description:

Ab Initio Gene Discovery. 5' 3' GC box. CAAT. box. GC. box. TATA ... Statistics-based (ab initio) Methods. Ab initio approach predict genes directly using the ... – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 71

Provided by: asia1

Category:

more less

Transcript and Presenter's Notes

Title: Genome Sequencing and Annotation

1
Genome Sequencing and Annotation
2

The first objective of most genome projects is to
determine the DNA sequence either of the genome
or of a large number of transcripts.

3
Automated DNA Sequencing

The Principle of Sanger Sequencing

4
ddNTP
5
(No Transcript)
6
Isotope label 33P or 35S
7

With a separation of 0.5 mm between bands
corresponding to each molecule that differs in
length by one nucleotide.
In general, only up to 500 bases could be read
from a single set of four lanes on a 30-cm gel.
Its difficult to sequencing in GC-rich region.

8
High-throughput Sequencing
9

Four-color fluorescent dyes have replaced the
radioactive label.
Rather than stopping the electrophoresis at a
particular time, the products are scanned for
laser-induced fluorescence just before they run
off the end of the electrophoresis medium.
Reads greater than 1200 bp in a run are possible
with current technology, though the 500-900 bp
range is more common.
Capillary electrophoresis

10
Anion carrier
11
ABI 3700 Sequencer
Twenty 96-well plats are handle per day 2Mb of
sequence in length be determined.
12
Reading Sequence Traces

The reading of raw sequence traces, or
base-calling, is now routinely performed using
automated software that read bases, aligns
similar sequences, and provides an intuitive
platform for editing.
phred free software
http//www.phrap.org/

13
(No Transcript)
14
2.2 The phred base-calling algorithm
15
2.3 Automated sequence chromatograms
lt 50 bp
5 ? 3
gt 800 bp
16
Contig Assembly
Optimizing sequence alignment with highest score
17
Box 2.1 Pairwise Sequence Alignment
18
2.5 An aligned-reads window in consed
Phrap assembler and consed graphic editor
http//www.phrap.org/
19
Several sequencing technology have been proposed
to reduce the cost.
20
Emerging Sequencing Methods

Sequencing by hybridization (SBH)
Mass spectrophotometric techniques
Nanopore sequencing strategies
Single-molecule Sanger sequencing
Polony sequencing

21
(No Transcript)
22
Mass spectrophotometric techniques
23
Mass spectrophotometric techniques DNA
sequencing use the same prinple
24
Nanopore sequencing strategies
25
http//arep.med.harvard.edu/Polonator/chem/
A
C
G
U
26
2.6 Single-molecule polony sequencing
http//arep.med.harvard.edu/Polonator/
27
Genome Sequencing

Whole chromosome sequences are reassembled from
the sequences of hundreds of thousands of
fragments, each typically between 500 and 1000 bp
in length.
Two general strategies
1. Hierarchical sequencing
2. Shotgun sequencing

28
2.7 (Part 1) Hierarchical versus shotgun
sequencing
29
2.7 (Part 2) Hierarchical versus shotgun
sequencing
30
Hierarchical sequencing

Also known as top-down, map-based, or
clone-by-clone strategy, developed in the late
1980s.
The first step is to clone the genome as
manageable units of some 50-200 kb in length,
which relative locations are known.
DNA libraries are constructed by partial
digestion or shearing of genomic DNA and are
ligated into cloning site using standard
recombinant DNA procedurce.

31
The copy number close to one per cell
32
2.8 Cloning vectors used in genome sequencing
33
2.9 Hierarchical assembly of a sequence-contig
scaffold (supercontig) --- tiling path
5- to 10-fold redundancy

Assembly methods
Hybridization (A)
Fingerprinting
(B,C,D)
3. End-sequencing

34
2.10 Aligning BAC clones by hybridization and
fingerprinting
35
End-sequencing
36
Shotgun sequencing

Sequence fragments with 5- to 10-fold redundancy
are generated from a plasmid library that
constructed from a single whole genome.
The contig scaffold are produced by Celera
Genomics using computer algorithm such as
Screener, Overlapper, and Unitigger.

37
2.11 U-unitigs and repeat resolution
Overlaps of at least 40 bp with no more 6
differences were accepted.
38
2.12 Proportion of fly and human genomes in
large scaffolds
39
Sequence Verification

Completeness
Accuracy
Validity of assembly

40
2.13 Alignment of two draft human genome
assemblies
41
Genome Annotation
42
EST Sequencing
43
EST
44
2.14 (Part 1) Relationship between gene
structure, cDNA, and EST sequences
Alternative splicing
45

Alternative splicing of mRNA
Splice pair

GU donor site
AG acceptor site
Exon Enhancer Motif
46

Four Modes of Alternative Splicing

Splice / Don't Splice
Competing 5' or 3' Splice Sites
47
Exon Skipping
Mutually Exclusive Exons
48
Ab Initio Gene Discovery
To predict gene according to sequence pattern
Flanking region
Flanking region
5'
3'

GT
AG
GT
AG
GC box
Initiation codon
Stop codon
Poly(A)-addition site
CAAT box
TSS
AATAA
GC box
TATA box
49

Statistics-based (ab initio) Methods
Ab initio approach predict genes directly
using the computational properties of exons,
introns, and other features in the genomic
sequences without the reference of the
experimental data.
(1) Hidden Markov Model
GenScan, Genie, Genemark, Veil,
HMMgene
(2) Neural Networks
Grail II, GrailEXP_Perceval
(3) Decision Tree
MZEF, MZEF-SPC
(4) Integration of Various Statistical
Approaches
FGENESH

50
Example Two dices1
S1
S2

One dice is fair, producing 1,2,3,4,5,6 with
equal probability. The other one is loaded and
produce number 6 with higher probability than
the rest numbers.
Each time only one dice is rolled and produces
numbers in the set 1,2,3,4,5,6 which can be
observed. But we do not know which of the two
dices are rolled.
The transitions between the fair dice and loaded
dice follow the Markov chain properties.

0.9
0.9
0.1
fair
loaded
0.1
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
6 4 5 3 1 3 2 6 5 1 4 5 6 3 6 6 6 4 6 3 1 6 5 6 6
6 2 4 5 3
F F F F F F F F F F F F L L L L L L L L L L L L L
L F F F F
51
Elements of a hidden markov model

An HMM is characterized by the follow elements
N, the number of hidden states. S1, S2, SN
( N 2 Fair dice, loaded dice )
M, the number of distinct observations. V1, V2,
VM
(M 6 1, 2, 3, 4, 5, 6)
Transition probabilities between the hidden
states
aij P(Sj Si is the previous state) for each
i, j from 1 to N.
( P(FF) 0.9, P(FL)0.1, P(LL) 0.9, P(LF)
0.1 )
Emission probability of each observation for each
hidden state bi(k) P(VkSi)
(bFair(k) 1/6 for k 1,2,3,4,5,6
bLoaded(k)1/10 for k 1,2,3,4,5 and bLoaded(6)
1/2)
Initial probabilities probability of each state
being the first state.

52
Modeling Exon-Intron Segments

We can treat the exon and intron segments on DNA
sequences as two distinct hidden states, each of
which emits A, T, C, G with its own
probabilities.

S1
S2
0.9
0.9
0.1
exon
intron
0.1
A 1/6 T 1/6 C 1/3 G 1/3
A 1/4 T 1/4 C 1/4 G 1/4
53
Modeling Exon-Intron Segments

Number of hidden states 2 (exon state and intron
state)
Number of observations 4 (A, T, C, G)
Transition probabilities
P(S1?S2) 0.1
P(S1?S1) 0.9
P(S2?S1) 0.1
P(S2?S2) 0.9
Emission probabilities
S1 P(A)P(T) 1/6
P(C)P(G) 1/3
S2 P(A)P(T)P(C)P(G)1/4

S1
S2
0.9
0.9
0.1
exon
intron
0.1
A 1/6 T 1/6 C 1/3 G 1/3
A 1/4 T 1/4 C 1/4 G 1/4
54
Neural Networks
55

Homology-based Methods
Homological approach identifies genes
with the aid of experimental data. This approach
exploits the alignment gene sequence between
genomic data and the known cDNA (or protein)
database.
(1) Local Alignment Methods (BLAST
-based)
AAT, GAIA , INFO, TAP
(2) Pattern-based Alignment Method
Flash, ICE

Combination Tools
These tools combine both sequence similarity
and ab initio gene finding approaches. They
predict genes by producing a splicing alignment
between a genomic sequence and a candidate amino
acid sequence.
Procrustes, GeneWise, GenomeScan
FGENESH and FGENESH
GrailEXP_Gawain and _GALAHAD

57
Evaluate accuracy

Definition

TPP
TPE
FP
FN
TN
TP(True Positive) correctly predicted as
coding TN(True Negative) correctly predicted as
noncoding FP(False Positive) noncodingà
coding FN(False Negative) coding à noncoding
58

Sensitivity
SnTPE/(TPFN)
Specificity
SpTPE/(TPFP)
Missing exon
ME (number of missing exons)/(number
of actual exons)
Wrong exon
WE(number of wrong exons)/(number of
predicted exons)

59
Accuracy of exon prediction

Statistics-based (ab initio) Methods
Sensitity/Specificity 30 70
Homology-based Methods
Sensitity/Specificity 99
when related protein sequences available
Combination Tools
????????

60
Regulatory Sequences
Phylogenetic footprinting two sequence
alignment and conserved pattern discovery, in
principle.
61
2.15 Phylogenetic shadowing
Uncover the conserved segments with multiple
sequence alignment
Candidate of Regulatory sequence
Larger diversity, Non-regulatory sequence
62
Non-Protein Coding Genes

It is difficult to identify non-protein coding
genes
Most of the transcripts are not polyadenylated,
and so lack of cDNA libraries.
Conserved in the level of secondary structure,
poor sequence constraint.
Relatively little is known about the function and
distribution.

63
Structural Features of Genome Sequences

Repetitive sequences (five classes)
GC content
Simple sequence repeats
Segmental duplication
Structure of centromeres and telomeres

64
The ENCODE Project

Only 5 of mammalian genome sequence is highly
conserved.
At most, only 2 of a typical mammalian genome
encode transcripts.
Most of the non-genic sequences are unknown.

65
(No Transcript)
66
Clusters of Orthologous Genes (COGs)
paralog
ortholog
67
http//www.ncbi.nlm.nih.gov/COG/
68
Gene Ontology
69
(No Transcript)
70
http//www.geneontology.org/

Write a Comment

User Comments (0)

About PowerShow.com

Genome Sequencing and Annotation - PowerPoint PPT Presentation

Genome Sequencing and Annotation

Ab Initio Gene Discovery. 5' 3' GC box. CAAT. box. GC. box. TATA ... Statistics-based (ab initio) Methods. Ab initio approach predict genes directly using the ... – PowerPoint PPT presentation