Gene Structure and Identification - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Gene Structure and Identification

Description:

Identify exons, promoter/enhancer elements. Codon Bias/Nucleotide Frequency-useful? ... Enhancers/Silencers/Regulatory Sites. Boundary elements? Transcription ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 33
Provided by: csta2
Category:

less

Transcript and Presenter's Notes

Title: Gene Structure and Identification


1
Gene Structure and Identification
  • Eukaryotic Genes and Genomes
  • Gene Finding

Previous reading 1.3, 9.1-9.6 Reading 10.2,
10.4, 10.6-8
BIO520 Bioinformatics Jim Lund
2
Complex Genome DNA
  • 10 highly repetitive (300 Mbp)
  • NOT GENES
  • 25 moderate repetitive (750 Mbp)
  • Some genes
  • 25 exons and introns (800 Mbp)
  • 40?
  • Regulatory regions
  • Intergenic regions

3
Eukaryotic Gene Expression
Promoter
Transcribed Region
Terminator
Enhancer
Transcription
RNA Polymerase II
Primary transcript 5
3
Intron1
Exon1
Exon2
Cap Splice Cleave/Polyadenylate
Translation
7mG
An
N
C
Transport
7mG
An
Polypeptide
4
Yeast
  • ORFSgenes!

Small ORFS (RNA genes) Regulatory Sequences
5
Eukaryotes, contd
  • large Eukaryotes
  • introns common, LONGER than exons
  • Promoter/enhancer
  • genome sparse
  • Fungi
  • introns common, short relative to exons
  • promoter/enhancer
  • genome dense

6
Intron Prevalence
of genes
Introns
7
Intron Size
of genes
Introns
8
Exon Size
of genes
Exon size (bps)
9
Fungi
  • Sew together exons
  • ORF regions
  • consensus sequences
  • domain/polypeptide matches

10
Exon/Intron Structure
CCACATTgtn(30-10,000)an(5-20)agCAGAA
...CCACATTCAGAA... ...ProHisSerGlu...
11
Alternative Splice
CCACATTgtn(30-10,000)an(5-20)agcagAA
...CCACATTAA... ...ProHisSTOP
12
Gene prediction targets
  • Internal exons (donor-acceptor)
  • Initial exons (5-donor)
  • Terminal exons (acceptor-3)
  • Single exon genes (5-3)

13
Gene prediction
  • Sequence based
  • Consensus sites
  • Signal sequences
  • Homology
  • Confirm prediction is a protein
  • Known coding sequences
  • cDNAs, SAGE
  • Comparative analysis
  • Identify exons, promoter/enhancer elements

14
Codon Bias/Nucleotide Frequency-useful?
  • High bias high confidence
  • Low bias low confidence

15
Finding Functional Sequences
  • Known Consensus Sequences
  • Consensus Sequence Generation
  • Functional Tests

16
Describing consensus sequences
  • Position Weight Matrices
  • Sequence Logos
  • Hidden Markov Models

17
Translation Initiation Sites
18
Splicing Consensus
A64G73GTA62A68G84T63 Y80NY80Y87R75AY95C65AGNN V
ert
GTRNGT(N)30-1000 CTRAC(N)5-15YAG Fungi
Alternate Splicing!??
19
Linguistic approach to combining gene features
  • Non-repetitive DNA!!
  • Long ORF
  • similar to known protein
  • ORF extended by reasonable splices
  • ORF begins with good ATG
  • Promoter/terminator flanks

20
DATABASE SEARCH
  • BLASTN
  • DNADNA comparison (ALWAYS!)
  • Not sensitive (DNA conservation low)
  • BLASTX/TBLASTX
  • ?6 frame ORFSpolypeptide database
  • 6 frames vs. 6 frames of a DNA database

www.ncbi.nlm.nih.gov
21
Protein Database Matches
  • Very helpful for the known
  • What about the unknown???

22
Transcript Initiation
  • Basal Promoters
  • Enhancers/Silencers/Regulatory Sites
  • Boundary elements?
  • Transcription Initation

Prokaryotes vs Eukaryotes Organism-to-Organism
23
Basal Promoter Analysis
Myers and Maniatis, Genes VI, 831
  • TATA-box -25 to -30 TBP
  • CCAAT-box -212 to -57 CTF/NF1
  • GC-box -164 to 1 SP1
  • K C W K Y Y Y Y 1 to 5 cap signal

1
24
Basal Promoter Analysis
Cao and Moi, Ped Res 51415-421 (2002)
25
mRNA processing
  • Exon/Intron
  • Alternate splicing
  • Polyadenylation/Cleavage
  • Stability

26
PolyA sites
  • Metazoans
  • AATAAA, ATTAAA
  • 15-20 bps 5 of polyA addition site.
  • YGTGTTYY (diffusive GT-rich sequence)
  • 100-700 bps 3 UTR typical.
  • Yeast-different

27
Translation
  • Initiation site
  • 1st AUG used 95 of the time.
  • Translational regulatory elements
  • translational enhancers
  • upstream ORFs

28
Tools-WWW
  • Genscan
  • Genie
  • GRAIL II integrated gene parsing
  • GenLang
  • HMMGene (lock ESTs, etc.)
  • GENEMARK

29
Hidden Markov Models
  • Probabilistic Models
  • Applicable to linear sequences
  • P(all states)1, infer probabilities of all
    states from observed (hidden states unobserved)
  • Work best when local correlations unimportant
  • Genefinding, phylogeny, secondary structure,
    genetic mapping
  • Parameters are set using a Training Set of gene
    annotations
  • Quantitative probabilities

30
Accuracy Assessment
PPpredicted coding APreal positive TPnumber
correct positive TNnumber correct
negative FPnumber false positive FNnumber
false negative
SensitivitySnTP/AP SpecificitySpTP/PP
Approximate Correlation (AC) ((TP/(TPFN))
(TP/(TPFP)) (TN/(TNFP)) (TN/(TNFN))) / 2
- 1
31
Accuracy Levels
Bp Exon
32
NEXT
  • Regulatory Sequences
  • Known Consensus Sequences
  • Consensus Sequence Generation
  • Functional (Lab) Data
  • Real examples
Write a Comment
User Comments (0)
About PowerShow.com