Title: Gene Structure and Identification
1Gene Structure and Identification
- Eukaryotic Genes and Genomes
- Gene Finding
Previous reading 1.3, 9.1-9.6 Reading 10.2,
10.4, 10.6-8
BIO520 Bioinformatics Jim Lund
2Complex Genome DNA
- 10 highly repetitive (300 Mbp)
- NOT GENES
- 25 moderate repetitive (750 Mbp)
- Some genes
- 25 exons and introns (800 Mbp)
- 40?
- Regulatory regions
- Intergenic regions
3Eukaryotic Gene Expression
Promoter
Transcribed Region
Terminator
Enhancer
Transcription
RNA Polymerase II
Primary transcript 5
3
Intron1
Exon1
Exon2
Cap Splice Cleave/Polyadenylate
Translation
7mG
An
N
C
Transport
7mG
An
Polypeptide
4Yeast
Small ORFS (RNA genes) Regulatory Sequences
5Eukaryotes, contd
- large Eukaryotes
- introns common, LONGER than exons
- Promoter/enhancer
- genome sparse
- Fungi
- introns common, short relative to exons
- promoter/enhancer
- genome dense
6Intron Prevalence
of genes
Introns
7Intron Size
of genes
Introns
8Exon Size
of genes
Exon size (bps)
9Fungi
- Sew together exons
- ORF regions
- consensus sequences
- domain/polypeptide matches
10Exon/Intron Structure
CCACATTgtn(30-10,000)an(5-20)agCAGAA
...CCACATTCAGAA... ...ProHisSerGlu...
11Alternative Splice
CCACATTgtn(30-10,000)an(5-20)agcagAA
...CCACATTAA... ...ProHisSTOP
12Gene prediction targets
- Internal exons (donor-acceptor)
- Initial exons (5-donor)
- Terminal exons (acceptor-3)
- Single exon genes (5-3)
13Gene prediction
- Sequence based
- Consensus sites
- Signal sequences
- Homology
- Confirm prediction is a protein
- Known coding sequences
- cDNAs, SAGE
- Comparative analysis
- Identify exons, promoter/enhancer elements
14Codon Bias/Nucleotide Frequency-useful?
- High bias high confidence
- Low bias low confidence
15Finding Functional Sequences
- Known Consensus Sequences
- Consensus Sequence Generation
- Functional Tests
16Describing consensus sequences
- Position Weight Matrices
- Sequence Logos
- Hidden Markov Models
17Translation Initiation Sites
18Splicing Consensus
A64G73GTA62A68G84T63 Y80NY80Y87R75AY95C65AGNN V
ert
GTRNGT(N)30-1000 CTRAC(N)5-15YAG Fungi
Alternate Splicing!??
19Linguistic approach to combining gene features
- Non-repetitive DNA!!
- Long ORF
- similar to known protein
- ORF extended by reasonable splices
- ORF begins with good ATG
- Promoter/terminator flanks
20DATABASE SEARCH
- BLASTN
- DNADNA comparison (ALWAYS!)
- Not sensitive (DNA conservation low)
- BLASTX/TBLASTX
- ?6 frame ORFSpolypeptide database
- 6 frames vs. 6 frames of a DNA database
www.ncbi.nlm.nih.gov
21Protein Database Matches
- Very helpful for the known
- What about the unknown???
22Transcript Initiation
- Basal Promoters
- Enhancers/Silencers/Regulatory Sites
- Boundary elements?
- Transcription Initation
Prokaryotes vs Eukaryotes Organism-to-Organism
23Basal Promoter Analysis
Myers and Maniatis, Genes VI, 831
- TATA-box -25 to -30 TBP
- CCAAT-box -212 to -57 CTF/NF1
- GC-box -164 to 1 SP1
- K C W K Y Y Y Y 1 to 5 cap signal
1
24Basal Promoter Analysis
Cao and Moi, Ped Res 51415-421 (2002)
25mRNA processing
- Exon/Intron
- Alternate splicing
- Polyadenylation/Cleavage
- Stability
26PolyA sites
- Metazoans
- AATAAA, ATTAAA
- 15-20 bps 5 of polyA addition site.
- YGTGTTYY (diffusive GT-rich sequence)
- 100-700 bps 3 UTR typical.
- Yeast-different
27Translation
- Initiation site
- 1st AUG used 95 of the time.
- Translational regulatory elements
- translational enhancers
- upstream ORFs
28Tools-WWW
- Genscan
- Genie
- GRAIL II integrated gene parsing
- GenLang
- HMMGene (lock ESTs, etc.)
- GENEMARK
29Hidden Markov Models
- Probabilistic Models
- Applicable to linear sequences
- P(all states)1, infer probabilities of all
states from observed (hidden states unobserved) - Work best when local correlations unimportant
- Genefinding, phylogeny, secondary structure,
genetic mapping - Parameters are set using a Training Set of gene
annotations - Quantitative probabilities
30Accuracy Assessment
PPpredicted coding APreal positive TPnumber
correct positive TNnumber correct
negative FPnumber false positive FNnumber
false negative
SensitivitySnTP/AP SpecificitySpTP/PP
Approximate Correlation (AC) ((TP/(TPFN))
(TP/(TPFP)) (TN/(TNFP)) (TN/(TNFN))) / 2
- 1
31Accuracy Levels
Bp Exon
32NEXT
- Regulatory Sequences
- Known Consensus Sequences
- Consensus Sequence Generation
- Functional (Lab) Data
- Real examples