Title: BCB 444544 Introduction to Bioinformatics
1BCB 444/544 - Introduction to Bioinformatics
Lecture 9 Gene Structure Prediction Protein
Function Prediction 9_Sept11
2Assignments Reading Exercises
Homework 2
- vMon Sept 11
- CH Chp 2.1 pp 34 - 59, DQs MMs
- Re Predicting Protein Function
- Read Friedberg I, Harder T, Godzik A. (2006)
JAFA a protein function annotation meta-server.
Nucleic Acids Res. 34 (Web Server issue)W379-81
PMID 16845030 - http//nar.oxfordjournals.org/cgi/content/full/34
/suppl_2/W379 - Visit http//jafa.burnham.org/
- Wed Sept 13 Fri Sept 15
- CH Chp 2.2 pp 59 - 83
- Also, DQs MMs
- Homework 2 - Due Mon, Sept 18
3Genome Sequence Acquisition Analysis
- 2.1 How are Genomes Sequenced?
- What Is Genomics?
- How Are Whole Genomes Sequenced?
- How Are Organisms Picked for Genome Sequencing?
- Math Minute 2.1 What Can You Learn from a Dot
Plot? - Math Minute 2.2 How Do You Find Motifs?
- Can We Predict Protein Functions from DNA
Sequence? - Math Minute 2.3 What Are "Positives"
- What Do They Have to Do with E-values?
Chp 2 - Campbell Heyer Companion Website
4Genome Sequence Acquisition Analysis
- 2.1 How are Genomes Sequenced? - cont.
- What Shapes Are the Proteins?
- Does Structure Reveal Function?
- Why Do the Databases Contain So Many Partial
Sequences? - Which Sequencing Method Worked Better?
- Annotated Genomes Online
- How Many Proteins Can One Gene Make?
- Can the Genome Alter Gene Expression Without
Changing the DNA Sequence? - What Is the Fifth Base in DNA? Methyl-Cytosine
- Imprinting, Methylation, and Cancer
- SUMMARY 2.1
Chp 2 - Campbell Heyer Companion Website
5Isn't there something puzzling about information
provided so far?
1st, Back to Chp1 QUESTIONS?
Hmmm. What is size of "average" human
chromosome? Genome 3 X 109 bp Divided among 23
(pairs of) chromosomes "Average" human chromosome
? X 106 bp (Mb)
130 Mb
6Chp1- Questions?
- Why do authors focus on these point mutations
when the patients' chromosomes have either big
deletions or translocations near/in the
dystrophin gene? - We were told that certain cases of DMD are
associated with these point mutations, which
result in changes in protein sequence - 54L?R (Leu to Arg) gtgt drastic phenotype, DMD
Duchenne MD - 168A?D (Ala to Asp) gtgt less severe, BMD Becker
MD - But, we were are told that the dystrophin gene is
huge! - Gene close to 1,000,000 bp long! (1 Megabase
or 1Mb) - mRNA 14,000 nt long
- Protein dimer (or tetramer, in crystal
structure, PDB 1DXX) - How many kDa (kilodaltons)?
- Did "the doctor" really sequence the entire gene
from each of these patients????
7UPDATES re Chp1 Questions? 1
- But, we were are told that the dystrophin gene is
huge! - Gene close to 1,000,000 bp long! (1 Megabase
or 1Mb) - Actually bigger! gt 2.5 Mb
- Spans 79 exons!
- Dystrophin gene is largest human gene!
- Accounts for 0.1 of human genome!
- mRNA 14,000 nt long
- Actually 14.6 kb! (with 11 kb "coding"
region) - Protein? gt 3,500 amino acids (aa's)
- Protein molecular weight? 427 kilodaltons
(kDa) - Complex! Several promoters alternative
splicing result in different proteins (in size
sequence) in different tissues -
- Wikipedia Dystrophin has the longest gene
known to date, measuring 2.4 megabases. Its
gene's locus is Xp21 and has 79 exons spanning
2.5 Mb, produces an mRNA of 14.6 kb and a protein
of over 3500 amino acid residues. The gene is so
large it accounts for 0.1 of the human genome!
(Don't worry - I checked data at ENTREZ Gene
OMIM)
8UPDATES re Chp1 Questions? 2
- Did "the doctor" really sequence the entire gene
from each of these patients???? - Probably not! But, direct sequencing of entire
coding region was - been reported this year
- Go to ENTREZ/PubMed type PMID 16331671
- Also, now several labs to do sequence the entire
coding region, but many provide other diagnostic
tests instead - Go to OMIM, type DMD, then click on Gene Tests
link - from here there is lots of info - e.g.,
click on Testing - Now, go back click on Reviews - wow,terms are
defined, too! - Now, click on Educational Materials at top - a
great glossary here! - Note that Gene Reviews is a great dynamic
online-only journal - try looking up another
disease you find interesting
9Questions?
2nd, Back to Friday's QUESTIONS?
- What are substitution matrices?
- 2 Major types PAM BLOSUM
- Re Assigned Reading for Chp 2 Lab 3
- Math Minute 2.3 - pp. 46 47
- Note MISTAKE on p.47
- Incorrect version
- "BLOSUM45 for finding more closely related
sequences BLOSUM80 for finding
more divergent proteins" -
- Correct version
- "BLOSUM45 for finding more divergent sequences
- BLOSUM80 for finding more closely related
proteins"
10Substitution Matrices Pam vs BLOSUM
- PAM Point Accepted Mutation - relies on
"evolutionary model" based on observed
differences in closely related proteins - Model includes rate for each type of sequence
change - Suffix number (n) reflects amount of "time"
passed rate of expected mutation if n of amino
acids had changed - PAM1 - for less divergent sequences (shorter
time) - PAM250 - for more divergent sequences (longer
time) - BLOSUM BLOck SUbstitution Matrix - based on
aa substitutions observed in evolutionarily
divergent proteins - Doesn't rely on a specific evolutionary model
- Suffix number (n) reflects expected similarity
average aa identity in the MSA from which the
matrix was generated - BLOSUM45 - for more divergent sequences
- BLOSUM80 - for less divergent sequences
See Substitution Matrix (Wikipedia)
11Gene Prediction Protein Function Prediction
- What is a gene? Segment of DNA, some of which is
"structural," i.e., transcribed to give a
functional RNA product, some of which is
"regulatory" - Genes can encode
- mRNA (i.e., for protein)
- other types of RNA (tRNA, rRNA, miRNA, etc.)
- Genes differ in eukaryotes vs prokaryotes (
archaea) - both structure regulation
12Eukaryotes vs Prokaryotes Cells
- Typical human bacterial cells drawn to scale
- Eukaryotic cells are characterized by
membrane-bound compartments, importantly, a
nucleus, which is absent in prokaryotes
Brown Fig 2.1
BIOS Scientific Publishers Ltd, 1999
13Eukaryotes vs Prokaryotes Genes Genomes
- Genes genomes in eukaryotes vs prokaryotes
- Have different structures and regulatory signals
- Eukaryotic genomes
- Are packaged in chromatin sequestered in a
nucleus - Are larger and have multiple chromosomes
- Contain mostly non-protein coding DNA (98-99)
14Eukaryotes vs Prokaryotes Genes Genomes
- Eukaryotic genes
- Are larger and more complex than in prokaryotes
- Contain introns that are spliced out to
generate mature mRNAs - Often undergo alternative splicing, giving rise
to multiple RNAs - Are transcribed by 3 different RNA polymerases,
(instead of 1, as in prokaryotes) -
- In biology, statements such as this include an
implicit usually or often
15Eukaryotes vs Prokaryotes Genes Regulation
- Primary level of control?
- Prokaryotes Transcription
- Eukaryotes Transcription is also important, but
- Expression is regulated at multiple levels
- e.g., RNA processing, transport, stability,
- protein processing, post-translational
modification, localization, stability - Recent discoveries small RNAs (miRNA, siRNA)
play very important regulatory roles in
eukaryotes, often at post-transcriptional levels
16Eukaryotic Gene Structure
- Genes are fragmented, containing
non-protein-coding introns between the functional
exons
17Synthesis Processing of Eukaryotic mRNA
Gene in DNA
18 cDNAs ESTs
- cDNA libraries are important for determining gene
- structure studying the regulation of gene
expression - Isolate RNA (always from a specific
- organism, region, and time point)
- Convert RNA to complementary DNA
- (with reverse transcriptase)
- Clone into cDNA vector
- Sequence the cDNA inserts
- Short cDNAs are called ESTs or
- Expressed Sequence Tags
- ESTs are strong evidence for genes
- Full-length cDNAs can be difficult to obtain
19UniGene unique genes via ESTs
- Find UniGene at NCBI
- www.ncbi.nlm.nih.gov/UniGene
- UniGene clusters contain many ESTs
- UniGene data come from many cDNA libraries.
- When you look up a gene in UniGene, you can
- obtain information re level tissue
- distribution of expression
20Gene Regulation
- Eukaryotes vs prokaryotes
- Prokaryotic operons promoters
- Eukaryotic promoters enhancers
- Eukaryotic transcription factors
- Promoters enhancers
- What does an RNA polymerase "see"?
21Prokaryotic Genes Operons
- Genes with related functions are often clustered
in operons (e.g., lac operon) - Operons are transcriptionally regulated as a
single unit - one promoter controls several
proteins - mRNAs produced are polycistronic - one mRNA
encodes several proteins i.e., there are
multiple ORFs, each with AUG (START) STOP
codons
22Prokaryotic promoters
- RNA polymerase complex recognizes promoter
sequences located very close to on 5 side
(upstream) of initiation site - RNA polymerase complex binds directly to these.
with no requirement for transcription factors - Prokaryotic promoter sequences are highly
conserved -10 region - -35 region
23Promoter for prokaryotic RNA polymerase (e.g.,
in bacterium, E. coli)
Brown Fig 9.17
BIOS Scientific Publishers Ltd, 1999
24Eukaryotic genes
- Genes with related functions are not usually
clustered, but share common regulatory regions
(promoters, enhancers, etc.) - Chromatin structure must be right for
transcription to occur
25Eukaryotic genes have large complex regulatory
regions
Cis-acting regulatory elements include Promoters
, enhancers, silencers Trans-acting regulatory
factors include Transcription factors (TFs),
chromatin remodeling complexes, small RNAs
Brown Fig 9.17
BIOS Scientific Publishers Ltd, 1999
26Eukaryotic Promoters Enhancers
- Both promoters enhancers are binding sites for
transcription factors - Promoters located relatively close to
initiation site - (but can be located within gene,
rather than upstream!) - Enhancers also required for regulated
transcription - (control expression in specific cell types,
developmental stages, in response to
environment, etc.) - RNA polymerase complexes do not specifically
recognize promoter sequences directly - Transcription factors bind first and serve as
landmarks for recognition by RNA polymerase
complexes
27Activators vs Repressors
Regions far from the promoter can act as
"enhancers" or "repressors" of transcription by
serving as binding sites for activator or
repressor proteins (TFs)
promoter
enhancer
Gene
100 - 50,000 bp
repressor
Activator proteins (TFs) bind to enhancers
interact with RNAP to stimulate transcription
repressor prevents binding of activator
Repressors block the action of activators
28Eukaryotic regulatory regions are complex (often
contain many different TF binding site motifs !!!)
Fig 9.13 Mount 2004
29Eukaryotic genes are transcribed by 3 different
RNA polymerases (Regulatory regions TFs
differ, too)
rRNA
mRNA
tRNA, 5S RNA
Brown Fig 9.18
BIOS Scientific Publishers Ltd, 1999
30Eukaryotic transcription factors
- Transcription factors (TFs) are DNA binding
proteins that also interact with the RNA
polymerase complex to activate or repress
transcription - TFs contain characteristic DNA binding motifs
- (these motifs are strings of amino acids
in TF protein sequences) - TFs recognize specific short DNA sequence motifs
called transcription factor binding sites - (these motifs are strings of nucleotides in DNA
sequences) - Several databases for these, e.g. TRANSFAC,
JASPAR -
31Zinc finger transcription factors
- Common in eukaryotic proteins
- 1 of mammalian genes encode zinc-finger
proteins - In C. elegans, there are 500!
- Can be used as highly specific DNA binding
modules - Potentially valuable tools for directed genome
modification (esp. in plants) human gene therapy
Brown Fig 9.12
BIOS Scientific Publishers Ltd, 1999
32 Gene Prediction
- Overview of steps strategies
- What sequence signals can be used?
- What other types of information can be used?
- Algorithms
- HMMs, Bayesian models, neural nets
- Gene prediction software
- 3 major types
- many, many programs!
33Overview of gene prediction strategies
What sequence signals can be used?
Transcription TF binding sites, promoter,
initiation site, terminator, GC islands, etc.
Processing signals splice donor/acceptors,
polyA signal Translation start (AUG Met)
stop (UGA,UUA, UAG) ORFs,
codon usage What other types of information can
be used? Homology (sequence comparison,
BLAST) cDNAs ESTs (experimental data,
pairwise alignment)