Title: Gene Structure and Identification
1Gene Structure and Identification
- Eukaryotic Genes and Genomes
- Gene Finding
Assigned reading Ch 5 Prev. reading Ch 1, Ch 3,
Ch 11, Ch 12
BIO520 Bioinformatics Jim Lund
2Complex Genome DNA
- 10 highly repetitive (300 Mbp)
- NOT GENES
- 25 moderate repetitive (750 Mbp)
- Some genes
- 25 exons and introns (800 Mbp)
- 40?
- Regulatory regions
- Intergenic regions
3Eukaryotic Gene Expression
Promoter
Transcribed Region
Terminator
Enhancer
Transcription
RNA Polymerase II
Primary transcript 5
3
Intron1
Exon1
Exon2
Cap Splice Cleave/Polyadenylate
Translation
7mG
An
N
C
Transport
7mG
An
Polypeptide
4Yeast
Small ORFS (RNA genes) Regulatory Sequences
5Eukaryotes, contd
- large Eukaryotes
- introns common, LONGER than exons
- Promoter/enhancer
- genome sparse
- Fungi
- introns common, short relative to exons
- promoter/enhancer
- genome dense
6Intron Prevalence
of genes
Introns
7Intron Size
of genes
Introns
8Exon Size
of genes
Exon size (bps)
9Fungi
- Sew together exons
- ORF regions
- consensus sequences
- domain/polypeptide matches
10Exon/Intron Structure
CCACATTgtn(30-10,000)an(5-20)agCAGAA
...CCACATTCAGAA... ...ProHisSerGlu...
11Alternative Splice
CCACATTgtn(30-10,000)an(5-20)agcagAA
...CCACATTAA... ...ProHisSTOP
12Gene prediction targets
- Internal exons (donor-acceptor)
- Initial exons (5-donor)
- Terminal exons (acceptor-3)
- Single exon genes (5-3)
13Gene prediction
- Sequence based
- Consensus sites
- Signal sequences
- Homology
- Confirm prediction is a protein
- Known coding sequences
- cDNAs, SAGE
- Comparative analysis
- Identify exons, promoter/enhancer elements
14Codon Bias/Nucleotide Frequency-useful?
- High bias high confidence
- Low bias low confidence
15Codon Bias/Nucleotide Frequency-useful?
- High bias high confidence
- Low bias low confidence
16Finding Functional Sequences
- Known Consensus Sequences
- Consensus Sequence Generation
- Functional Tests
17Consensus Inference
- Position Weight Matrices
- Sequence Logos
- Hidden Markov Models
18Translation Initiation Sites
19Splicing Consensus
A64G73GTA62A68G84T63 Y80NY80Y87R75AY95C65AGNN V
ert
GTRNGT(N)30-1000 CTRAC(N)5-15YAG Fungi
Alternate Splicing!??
20Linguistic Approach
- Non-repetitive DNA!!
- Long ORF
- similar to known protein
- ORF extended by reasonable splices
- ORF begins with good ATG
- Promoter/terminator flanks
21DATABASE SEARCH
- BLASTN
- DNADNA comparison (ALWAYS!)
- Not sensitive (DNA conservation low)
- BLASTX/TBLASTX
- ?6 frame ORFSpolypeptide database
- 6 frames vs. 6 frames of a DNA database
www.ncbi.nlm.nih.gov
22Protein Database Matches
- Very helpful for the known
- What about the unknown???
23Transcript Initiation
- Basal Promoters
- Enhancers/Silencers/Regulatory Sites
- Boundary elements?
- Transcription Initation
Prokaryotes vs Eukaryotes Organism-to-Organism
24Basal Promoter Analysis
Myers and Maniatis, Genes VI, 831
- TATA-box -25 to -30 TBP
- CCAAT-box -212 to -57 CTF/NF1
- GC-box -164 to 1 SP1
- K C W K Y Y Y Y 1 to 5 cap signal
1
25Basal Promoter Analysis
Cao and Moi, Ped Res 51415-421 (2002)
26mRNA processing
- Exon/Intron
- Alternate splicing
- Polyadenylation/Cleavage
- Stability
27PolyA sites
- Metazoans
- AATAAA, ATTAAA
- 15-20 bps 5 of polyA addition site.
- YGTGTTYY (diffusive GT-rich sequence)
- 100-700 bps 3 UTR typical.
- Yeast-different
28Translation
- Initation site
- 1st AUG used 95 of the time.
- Translational regulatory elements
- translational enhancers
- upstream ORFs
29Tools-WWW
- Genscan
- Genie
- GRAIL II integrated gene parsing
- GenLang
- HMMGene (lock ESTs, etc.)
- GENEMARK
30Hidden Markov Models
- Probabilistic Models
- Applicable to linear sequences
- P(all states)1, infer probabilities of all
states from observed (hidden states unobserved) - Work best when local correlations unimportant
- Genefinding, phylogeny, secondary structure,
genetic mapping - Pararmeters are set using a Training Set of
gene annotations - Quantitative probabilities
31Accuracy Assessment
PPpredicted coding PNpredicted
non-coding APreal positive ANreal
negatives TPnumber correct positive TNnumber
correct negative FPnumber false
positive FNnumber false negative
SnTP/AP SpTP/PP
AC ((TP/(TPFN)) (TP/(TPFP))
(TN/(TNFP)) (TN/(TNFN))) / 2 - 1
32Accuracy Levels
Bp Exon
33NEXT
- Regulatory Sequences
- Known Consensus Sequences
- Consensus Sequence Generation
- Functional (Lab) Data
- Real examples
34Gene Regulatory Sequences
- Functional sites
- Consensus
- Experimental tests
- Inferred sites
- Transcriptome analysis
35Regulatory Sites
- Transcript initiation
- mRNA processing
- Translation sites
36Regulatory Factors
- lacI, trpR, CAP, araC.
- GAL4, NDT80
Known from experiment Infer from genome? Infer
from expression data?
37EUKARYOTES
- More complex signals
- More genes
- More dispersed signals
- Combinatoric regulation common
38Enhancer Elements
- DNA element Protein
- Octamer OCT1, OCT2
- ?B NF ?B
- ATF ATF
- AP1 AP1
- ..
False , False -
39Consensus Sequence Databases
- WWW-based
- TFD (transcription factor database)
- BCM Search launcher
40Transcriptome Analyses
- Microarray transcription analysis
- Expression
- Transcription factor binding
- MEME analysis of clusters
More later....
41Practical Gene Finding
- Use ALL tools
- Comparative
- BLASTN, BLASTX
- Predictive Stitch together a consensus
- HMM, GRAIL
- ORF finders
- Findpatterns (and WWW pattern searches)
- cDNA OR protein OR genetic evidence
42FRAMES-aldolase gene
43If aldolase is so tough, how do you really do it?
- Combine DNA sequence
- with other data!
44Genome-cDNA
P
DNA sequencing
Align (GAP)
cDNA
45Comparative Genomics
- Conservation of coding regions
- Identification of transcription signals
- words in common
- Example-yeast comparisons