Title: Genomics 101
1- Genomics 101
- DNA sequencing
- Alignment
- Gene identification
- Gene expression
- Genome evolution
-
2Next Few Topics
- Gene Recognition
- Finding genes in DNA with computational methods
- Large-scale alignment multiple alignment
- Comparing whole genomes, or large families of
genes - Gene Expression and Regulation
- Measuring the expression of many genes at a time
- Finding elements in DNA that control the
expression of genes
3Gene Recognition
Credits for slides Marina Alexandersson Lior
Pachter Serge Saxonov
4Reading
- GENSCAN
- EasyGene
- SLAM
- Twinscan
- Optional
- Chris Burges Thesis
5Gene expression
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
6Gene structure
intron1
intron2
exon2
exon3
exon1
transcription
splicing
translation
Codon A triplet of nucleotides that is converted
to one amino acid
exon protein-coding intron non-coding
7Where are the genes?
8In humans 22,000 genes 1.5 of human DNA
9Finding Genes
- Exploit the regular gene structure
- ATGExon1Intron1Exon2ExonNSTOP
- Recognize coding bias
- CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-
- Recognize splice sites
- IntroncAGtExongGTgagIntron
- Model the duration of regions
- Introns tend to be much longer than exons, in
mammals - Exons are biased to have a given minimum length
- Use cross-species comparison
- Gene structure is conserved in mammals
- Exons are more similar (85) than introns
10Approaches to gene finding
- Homology
- BLAST, Procrustes.
- Ab initio
- Genscan, Genie, GeneID.
- Hybrids
- GenomeScan, GenieEST, Twinscan, SGP, ROSETTA,
CEM, TBLASTX, SLAM.
111. Exploit the regular gene structure
Splice sites
12Next Exon Frame 0
Next Exon Frame 1
132. Recognize coding bias
- Each exon can be in one of three frames
- aggattacagattacagattacagtaag Frame 0
- aggattacagattacagattacagtaag Frame 1
- aggattacagattacagattacagtaag Frame 2
- Frame of next exon depends on how many
nucleotides are left over from previous exon - Codons tag, tga, and taa are STOP
- No STOP codon appears in-frame, until end of gene
- Absence of STOP is called open reading frame
(ORF) - Different codons appear with different
frequenciescoding bias
142. Recognize coding bias
Amino Acid SLC DNA codons Isoleucine I ATT, ATC,
ATA Leucine L CTT, CTC, CTA, CTG, TTA,
TTG Valine V GTT, GTC, GTA, GTG Phenylalanine F T
TT, TTC Methionine M ATG Cysteine C TGT,
TGC Alanine A GCT, GCC, GCA, GCG
Glycine G GGT, GGC, GGA, GGG Proline P CCT,
CCC, CCA, CCG Threonine T ACT, ACC, ACA,
ACG Serine S TCT, TCC, TCA, TCG, AGT,
AGC Tyrosine Y TAT, TAC Tryptophan W TGG Glutamin
e Q CAA, CAG Asparagine N AAT,
AAC Histidine H CAT, CAC Glutamic acid E GAA,
GAG Aspartic acid D GAT, GAC Lysine K AAA,
AAG Arginine R CGT, CGC, CGA, CGG, AGA, AGG Stop
codons Stop TAA, TAG, TGA Can map 61 non-stop
codons to frequencies take log-odds ratios
15atg
caggtg
ggtgag
cagatg
ggtgag
cagttg
ggtgag
caggcc
ggtgag
tga
16Biology of Splicing
(http//genes.mit.edu/chris/)
173. Recognize splice sites
Donor 7.9 bits Acceptor 9.4 bits (Stephens
Schneider, 1996)
(http//www-lmmb.ncifcrf.gov/toms/sequencelogo.ht
ml)
183. Recognize splice sites
193. Recognize splice sites
- WMM weight matrix model PSSM (Staden 1984)
- WAM weight array model 1st order Markov (Zhang
Marr 1993) - MDD maximal dependence decomposition (Burge
Karlin 1997) - Decision-tree algorithm to take pairwise
dependencies into account - For each position I, calculate Si ?j?i?2(Ci,
Xj) - Choose i such that Si is maximal and partition
into two subsets, until - No significant dependencies left, or
- Not enough sequences in subset
- Train separate WMM models for each subset
G5
G5G-1
G5G-1 A2
G5G-1 A2U6
All donor splice sites
not G5
G5 not G-1
G5G-1 not A2
G5G-1A2 not U6
204. Model the duration of regions
21Hidden Markov Models for Gene Finding
First Exon State
Intron State
Intergene State
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
22Hidden Markov Models for Gene Finding
First Exon State
Intron State
Intergene State
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
23Duration HMM for Gene Finding
Duration Modeling Introns regular HMM
statesgeometric duration Exons special duration
model VE0,0(i) maxd1D Probduration(E0,0)d
?aIntron0,E0,0? ?ji-d1ieE0,0(xj)
where i is an admissible exon-ending
state, D is restricted by the longest ORF
GENSCAN Chris Burge and Sam Karlin, 1997 Best
performing de novo gene finder HMM with duration
modeling for Exon states
duration
24HMM-based Gene Finders
- GENSCAN (Burge 1997)
- Big jump in accuracy of de novo gene finding
- Currently, one of the best
- HMM with duration modeling for Exon states
- FGENESH (Solovyev 1997)
- Currently one of the best
- HMMgene (Krogh 1997)
- GENIE (Kulp 1996)
- GENMARK (Borodovsky McIninch 1993)
- VEIL (Henderson, Salzberg, Fasman 1997)
25Better way to do it negative binomial
- EasyGene
- Prokaryotic
- gene-finder
- Larsen TS, Krogh A
- Negative binomial with n 3
26GENSCANs hidden weapon
- CG content is correlated with
- Gene content ()
- Mean exon length ()
- Mean intron length ()
- These quantities affect parameters of model
- Solution
- Train parameters of model in four different CG
content ranges!
27Evaluation of Accuracy
Coding / No Coding
(Slide by NF Samatova)
28Results of GENSCAN
- On the initial test dataset (Burset Guigo)
- 80 exact exon detection
- 10 partial exons
- 10 wrong exons
- In general
- HMMs have been best in de novo prediction
- In practice they overpredict human genes by 2x