Genome Sequencing and Annotation - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Genome Sequencing and Annotation

Description:

Ab Initio Gene Discovery. 5' 3' GC box. CAAT. box. GC. box. TATA ... Statistics-based (ab initio) Methods. Ab initio approach predict genes directly using the ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 71
Provided by: asia1
Category:

less

Transcript and Presenter's Notes

Title: Genome Sequencing and Annotation


1
Genome Sequencing and Annotation
2
  • The first objective of most genome projects is to
    determine the DNA sequence either of the genome
    or of a large number of transcripts.

3
Automated DNA Sequencing
  • The Principle of Sanger Sequencing

4
ddNTP
5
(No Transcript)
6
Isotope label 33P or 35S
7
  • With a separation of 0.5 mm between bands
    corresponding to each molecule that differs in
    length by one nucleotide.
  • In general, only up to 500 bases could be read
    from a single set of four lanes on a 30-cm gel.
  • Its difficult to sequencing in GC-rich region.

8
High-throughput Sequencing
9
  • Four-color fluorescent dyes have replaced the
    radioactive label.
  • Rather than stopping the electrophoresis at a
    particular time, the products are scanned for
    laser-induced fluorescence just before they run
    off the end of the electrophoresis medium.
  • Reads greater than 1200 bp in a run are possible
    with current technology, though the 500-900 bp
    range is more common.
  • Capillary electrophoresis

10
Anion carrier
11
ABI 3700 Sequencer
Twenty 96-well plats are handle per day 2Mb of
sequence in length be determined.
12
Reading Sequence Traces
  • The reading of raw sequence traces, or
    base-calling, is now routinely performed using
    automated software that read bases, aligns
    similar sequences, and provides an intuitive
    platform for editing.
  • phred free software
  • http//www.phrap.org/

13
(No Transcript)
14
2.2 The phred base-calling algorithm
15
2.3 Automated sequence chromatograms
lt 50 bp
5 ? 3
gt 800 bp
16
Contig Assembly
Optimizing sequence alignment with highest score
17
Box 2.1 Pairwise Sequence Alignment
18
2.5 An aligned-reads window in consed
Phrap assembler and consed graphic editor
http//www.phrap.org/
19
Several sequencing technology have been proposed
to reduce the cost.
20
Emerging Sequencing Methods
  • Sequencing by hybridization (SBH)
  • Mass spectrophotometric techniques
  • Nanopore sequencing strategies
  • Single-molecule Sanger sequencing
  • Polony sequencing

21
(No Transcript)
22
Mass spectrophotometric techniques
23
Mass spectrophotometric techniques DNA
sequencing use the same prinple
24
Nanopore sequencing strategies
25
http//arep.med.harvard.edu/Polonator/chem/
A
C
G
U
26
2.6 Single-molecule polony sequencing
http//arep.med.harvard.edu/Polonator/
27
Genome Sequencing
  • Whole chromosome sequences are reassembled from
    the sequences of hundreds of thousands of
    fragments, each typically between 500 and 1000 bp
    in length.
  • Two general strategies
  • 1. Hierarchical sequencing
  • 2. Shotgun sequencing

28
2.7 (Part 1) Hierarchical versus shotgun
sequencing
29
2.7 (Part 2) Hierarchical versus shotgun
sequencing
30
Hierarchical sequencing
  • Also known as top-down, map-based, or
    clone-by-clone strategy, developed in the late
    1980s.
  • The first step is to clone the genome as
    manageable units of some 50-200 kb in length,
    which relative locations are known.
  • DNA libraries are constructed by partial
    digestion or shearing of genomic DNA and are
    ligated into cloning site using standard
    recombinant DNA procedurce.

31
The copy number close to one per cell
32
2.8 Cloning vectors used in genome sequencing
33
2.9 Hierarchical assembly of a sequence-contig
scaffold (supercontig) --- tiling path
5- to 10-fold redundancy
  • Assembly methods
  • Hybridization (A)
  • Fingerprinting
  • (B,C,D)
  • 3. End-sequencing

34
2.10 Aligning BAC clones by hybridization and
fingerprinting
35
End-sequencing
36
Shotgun sequencing
  • Sequence fragments with 5- to 10-fold redundancy
    are generated from a plasmid library that
    constructed from a single whole genome.
  • The contig scaffold are produced by Celera
    Genomics using computer algorithm such as
    Screener, Overlapper, and Unitigger.

37
2.11 U-unitigs and repeat resolution
Overlaps of at least 40 bp with no more 6
differences were accepted.
38
2.12 Proportion of fly and human genomes in
large scaffolds
39
Sequence Verification
  • Completeness
  • Accuracy
  • Validity of assembly

40
2.13 Alignment of two draft human genome
assemblies
41
Genome Annotation
42
EST Sequencing
43
EST
44
2.14 (Part 1) Relationship between gene
structure, cDNA, and EST sequences
Alternative splicing
45
  • Alternative splicing of mRNA
  • Splice pair

GU donor site
AG acceptor site
Exon Enhancer Motif
46
  • Four Modes of Alternative Splicing

Splice / Don't Splice   
Competing 5' or 3' Splice Sites 
47
Exon Skipping  
Mutually Exclusive Exons 
48
Ab Initio Gene Discovery
To predict gene according to sequence pattern
Flanking region
Flanking region
5'
3'

GT
AG
GT
AG
GC box
Initiation codon
Stop codon
Poly(A)-addition site
CAAT box
TSS
AATAA
GC box
TATA box
49
  • Statistics-based (ab initio) Methods
  • Ab initio approach predict genes directly
    using the computational properties of exons,
    introns, and other features in the genomic
    sequences without the reference of the
    experimental data.
  • (1) Hidden Markov Model
  • GenScan, Genie, Genemark, Veil,
    HMMgene
  • (2) Neural Networks
  • Grail II, GrailEXP_Perceval
  • (3) Decision Tree
  • MZEF, MZEF-SPC
  • (4) Integration of Various Statistical
    Approaches
  • FGENESH

50
Example Two dices1
S1
S2
  • One dice is fair, producing 1,2,3,4,5,6 with
    equal probability. The other one is loaded and
    produce number 6 with higher probability than
    the rest numbers.
  • Each time only one dice is rolled and produces
    numbers in the set 1,2,3,4,5,6 which can be
    observed. But we do not know which of the two
    dices are rolled.
  • The transitions between the fair dice and loaded
    dice follow the Markov chain properties.

0.9
0.9
0.1
fair
loaded
0.1
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
6 4 5 3 1 3 2 6 5 1 4 5 6 3 6 6 6 4 6 3 1 6 5 6 6
6 2 4 5 3
F F F F F F F F F F F F L L L L L L L L L L L L L
L F F F F
51
Elements of a hidden markov model
  • An HMM is characterized by the follow elements
  • N, the number of hidden states. S1, S2, SN
  • ( N 2 Fair dice, loaded dice )
  • M, the number of distinct observations. V1, V2,
    VM
  • (M 6 1, 2, 3, 4, 5, 6)
  • Transition probabilities between the hidden
    states
  • aij P(Sj Si is the previous state) for each
    i, j from 1 to N.
  • ( P(FF) 0.9, P(FL)0.1, P(LL) 0.9, P(LF)
    0.1 )
  • Emission probability of each observation for each
    hidden state bi(k) P(VkSi)
  • (bFair(k) 1/6 for k 1,2,3,4,5,6
    bLoaded(k)1/10 for k 1,2,3,4,5 and bLoaded(6)
    1/2)
  • Initial probabilities probability of each state
    being the first state.

52
Modeling Exon-Intron Segments
  • We can treat the exon and intron segments on DNA
    sequences as two distinct hidden states, each of
    which emits A, T, C, G with its own
    probabilities.

S1
S2
0.9
0.9
0.1
exon
intron
0.1
A 1/6 T 1/6 C 1/3 G 1/3
A 1/4 T 1/4 C 1/4 G 1/4
53
Modeling Exon-Intron Segments
  • Number of hidden states 2 (exon state and intron
    state)
  • Number of observations 4 (A, T, C, G)
  • Transition probabilities
  • P(S1?S2) 0.1
  • P(S1?S1) 0.9
  • P(S2?S1) 0.1
  • P(S2?S2) 0.9
  • Emission probabilities
  • S1 P(A)P(T) 1/6
  • P(C)P(G) 1/3
  • S2 P(A)P(T)P(C)P(G)1/4

S1
S2
0.9
0.9
0.1
exon
intron
0.1
A 1/6 T 1/6 C 1/3 G 1/3
A 1/4 T 1/4 C 1/4 G 1/4
54
Neural Networks
55
  • Homology-based Methods
  • Homological approach identifies genes
    with the aid of experimental data. This approach
    exploits the alignment gene sequence between
    genomic data and the known cDNA (or protein)
    database.
  • (1) Local Alignment Methods (BLAST
    -based)
  • AAT, GAIA , INFO, TAP
  • (2) Pattern-based Alignment Method
  • Flash, ICE

56
  • Combination Tools
  • These tools combine both sequence similarity
    and ab initio gene finding approaches. They
    predict genes by producing a splicing alignment
    between a genomic sequence and a candidate amino
    acid sequence.
  • Procrustes, GeneWise, GenomeScan
  • FGENESH and FGENESH
  • GrailEXP_Gawain and _GALAHAD

57
Evaluate accuracy
  • Definition

TPP
TPE
FP
FN
TN
TP(True Positive) correctly predicted as
coding TN(True Negative) correctly predicted as
noncoding FP(False Positive) noncodingà
coding FN(False Negative) coding à noncoding
58
  • Sensitivity
  • SnTPE/(TPFN)
  • Specificity
  • SpTPE/(TPFP)
  • Missing exon
  • ME (number of missing exons)/(number
    of actual exons)
  • Wrong exon
  • WE(number of wrong exons)/(number of
    predicted exons)

59
Accuracy of exon prediction
  • Statistics-based (ab initio) Methods
  • Sensitity/Specificity 30 70
  • Homology-based Methods
  • Sensitity/Specificity 99
  • when related protein sequences available
  • Combination Tools
  • ????????

60
Regulatory Sequences
Phylogenetic footprinting two sequence
alignment and conserved pattern discovery, in
principle.
61
2.15 Phylogenetic shadowing
Uncover the conserved segments with multiple
sequence alignment
Candidate of Regulatory sequence
Larger diversity, Non-regulatory sequence
62
Non-Protein Coding Genes
  • It is difficult to identify non-protein coding
    genes
  • Most of the transcripts are not polyadenylated,
    and so lack of cDNA libraries.
  • Conserved in the level of secondary structure,
    poor sequence constraint.
  • Relatively little is known about the function and
    distribution.

63
Structural Features of Genome Sequences
  • Repetitive sequences (five classes)
  • GC content
  • Simple sequence repeats
  • Segmental duplication
  • Structure of centromeres and telomeres

64
The ENCODE Project
  • Only 5 of mammalian genome sequence is highly
    conserved.
  • At most, only 2 of a typical mammalian genome
    encode transcripts.
  • Most of the non-genic sequences are unknown.

65
(No Transcript)
66
Clusters of Orthologous Genes (COGs)
paralog
ortholog
67
http//www.ncbi.nlm.nih.gov/COG/
68
Gene Ontology
69
(No Transcript)
70
http//www.geneontology.org/
Write a Comment
User Comments (0)
About PowerShow.com