Title: Gene Structure and Identification
1Gene Structure and Identification
- Genes and Genomes
- ORFs and more
- Consensus Sequences
- Gene Finding
Reading sections 1.3, 9.1-9.6
BIO520 Bioinformatics Jim Lund
2Gene
- The functional and physical unit of heredity
passed from parent to offspring. Genes are pieces
of DNA, and most genes contain the information
for making a specific protein.
3Gene-Informatics
- Genes are character strings embedded in much
larger strings called the genome. A gene usually
encodes a protein. Genes are composed of ordered
elements associated with the fundamental genetic
processes including transcription, splicing, and
translation.
4ACGT to Gene
- Cells recognize genes from DNA sequence.
5Genes
- Protein Coding
- RNA genes
- rRNA
- tRNA
- siRNA, miRNA, snRNA, snoRNA
6Genomes
- Genome seq. has only limited use by itself
- Markers, SNPs, etc.
- Functional annotation
- Identify proteins and their functions.
- And regulatory regions, etc.
- Parts list a source for understanding all
biology--and ushers in the post-genomic age of
biology.
7Genomes
2002 Mus musculus 2,500,000,000
8Characteristics of Protein Coding Genes
- ORF
- long (usually gt100 aa)
- known proteins?likely
- Basal signals
- Transcription, splicing, translation
- Regulatory signals
- Depend on organism
- Prokaryotes vs Eukaryotes
- Verterbrate vs fungi, eg.
9Infer Gene StructureGene Model
- Promoter
- Strength
- Regulation
- mRNA
- Exons
- Splicing
- Stability
- ORFprotein
10GenomesGene Content
E. coli 4000 genes X 1 kbp/gene4 Mbp Genome4
Mbp!
Gene-rich
11GenomesGene Content
Human 26,755 genes X 2 kbp54 Mbp mRNA
Introns300 Mbp? Regulatory regions300 Mbp?
2344 Mbp???
12Complex Genome DNA
- 10 highly repetitive (300 Mbp)
- NOT GENES
- 25 moderate repetitive (750 Mbp)
- Some genes
- 10 exons and introns (340 Mbp)
- 55 ?
- Regulatory regions
- Intergenic regions
Hard!!
13Easy problemBacterial Gene Finding
- Dense Genomes
- Short intergenic regions
- Uninterrupted ORFs
- Conserved signals
- Abundant comparative information
- Complete Genomes
14E. coli genome
- 4415 genes
- Ave. distance between genes 118bp
- 318 aa, average protein length
- 57 proteins longer than 1000 aa.
- 318 shorter than 100 aa.
- 2584 operons, 70 contain one gene.
- 1.5 repetitive DNA (mostly viral fragments).
15Prokaryotic Gene Expression
Promoter
Cistron1
Cistron2
CistronN
Terminator
Transcription
RNA Polymerase
mRNA 5
3
1
2
N
N
N
C
N
C
C
1
2
3
Polypeptides
16Prokaryotic gene prediction
- ORFs
- Biased nucleotide distribution
- Periodicity of 3
- Codon bias (codon usage statistics)
- Also called Codon Adaptation Index (CAI).
- Signal sequences
- Homology
- Other biological info for E. coli, partial
N-terminal protein sequences.
17Prokaryotic signal sequences
- Ribosome binding site (RBS)/Shine-Delgarno
element - 3-9 purines complementary to sequence at 3 end
of the 16S rRNA in the small subunit of the
ribosome. - Located 4-7 bps 5 of the AUG.
- Promoter
- -35 consensus site (TTGACA)
- -10 consensus site (TATAAT)
- Signal peptides
- Regulatory protein binding sites (4 to 8bps)
18ORFs
P(ORF)(61/64)n
P(20)(61/64)20.38
P(100)0.008
P(200)10-4
19ORF finding tools
- VectorNTI
- Analyze/ORF
- Testcode (Ficketts)
- CodonPreference
- WWW tools
- ORF Finder (NCBI)
- BCM Search Launcher...
20ORFs in E. coli
Frame
1
2
3
-1
-2
-3
21Codon Bias
- Genetic code degenerate
- Codon usage varies
- Organism to organism
- Gene to gene
- High bias correlates with high level expression
- Bias correlates with tRNA isoacceptors
- Change bias or tRNAs, change expression
22Codon Bias
Gly GGG 6 0.21 Gly GGA 6 0.17 Gly GGT 6 0.38 Gly G
GC 6 0.24
23Codon Bias Gene Differences
GAL4 ADH1 Gly GGG 0.21 0 Gly GGA 0.17 0 Gly
GGT 0.38 0.93 Gly GGC 0.24 0.07
24Nucleotide Bias
- Coding DNA vs non-Coding DNA
- often GC content higher than bulk
- Empirical statistics (Ficketts TESTCODE)
- Useful
- ORF matches typical
- organism, bias
- ORF obscured by STOP codons
DNA sequence Errors?
25We found ORFs-now what?
- Work backwards
- Locate adjacent cistrons
- Locate RBS
- Locate promoter
- Locate terminator
- Locate regulatory sites
26Operon Structure
Promoter?
27TranslationRibosome Binding Site, Shine-Dalgarno
Site
nnAGGAGGnnnnnATG Consensus not always used,
example E. coli gene nnAaGAGGnnnnATG
28Bacterial Promoter
-35 T82T84G78A65C54A45 (16-18
bp) T80A95T45A60A50T96(A,G) -10 1
Alternate sigma factors CCCTTGAA.CCCGATNT
29Terminators
- Stem/loop
- structural only
- 3-U tail
- Rho-independent
- C-rich
- G-poor
- loose consensus
- Rho-dependent
30Difficulties in gene prediction
- Frame shifts
- sequencing errors
- Overlapping ORFs
- Rare (a few percent)
- Short ORFs
- Unusual genes
- bp composition
- signal sequences
31Programs for prokaryotic gene prediction
- Glimmer
- ORPHEUS
- GeneMark
- 90 sensitivity and specificity
- GENSCAN
- Vector NTI (ORF analysis)