PROMOTER - PowerPoint PPT Presentation

About This Presentation
Title:

PROMOTER

Description:

PROMOTER – PowerPoint PPT presentation

Number of Views:85
Slides: 33
Provided by: kiransree
Category:
Tags:

less

Transcript and Presenter's Notes

Title: PROMOTER


1
Biological MotivationGene Finding in Eukaryotic
Genomes
  • Anne R. Haake
  • Rhys Price Jones

2
Recall from our previous discussion of gene
finding in prokaryotes
  • The major strategies in gene finding programs
    are to look for
  • Signals/Features
  • Content/Composition
  • Similarity to known genes (BLAST!)

3
3 Major Categories of Information used in Gene
Finding Programs
  • Signals/features a sequence pattern with
    functional significance e.g. splice donor
    acceptor sites, start and stop codons, promoter
    features such as TATA boxes, TF binding sites
  • Content/composition -statistical properties of
    coding vs. non-coding regions.
  • e.g. codon-bias length of ORFs in prokaryotes
    CpG islands GC content
  • Similarity-compare DNA sequence to known
    sequences in database
  • Not only known proteins but also ESTs, cDNAs

4
In Prokaryotic Genomes
  • We usually start by looking for an ORF
  • A start codon, followed by (usually) at least 60
    amino acid codons before a stop codon occurs
  • Or by searching for similarity to a known ORF
  • Look for basal signals
  • Transcription (the promoter consensus and the
    termination consensus)
  • Translation (ribosome binding site the
    Shine-Dalgarno sequence)
  • Look for differences in sequence content between
    coding and non-coding DNA
  • GC content and codon bias

5
The Complicating factors in Eukaryotes
  • Interrupted genes (split genes)
  • introns and exons
  • Large genomes
  • Most DNA is non-coding
  • introns, regulatory regions, junk DNA (unknown
    function)
  • About 3 coding
  • Complex regulation of gene expression
  • Regulatory sequences may be far away from start
    codon

6
Some numbers to consider
  • Vertebrate genes average about 30Kb long
  • varies a lot
  • Coding region is only about 1-2 Kb
  • Exon sizes and numbers vary a lot
  • Average is 6 exons, each about 150 bp long
  • An average 5 UTR is about 750 bp
  • An average 3UTR is about 450 bp
  • (both can be much longer)
  • There are huge deviations from all of these
    numbers
  • e.g. dystrophin is 2.4 Mb long factor VIII gene
    has 26 exons, introns are up to 32 Kb (one intron
    produces 2 transcripts unrelated to the gene!)
  • There are genes without introns called
    single-exon or intronless genes

7
Eukaryotic Gene Structure
www.bio.purdue.edu/courses/biol516/eukgenestructur
e.gif
8
Given a long eukaryotic DNA sequence
  • How would you determine if it had a gene?
  • How would you determine which substrings of the
    sequence contained protein-coding regions?

9
  • In prokaryotic genomes we usually start by
    looking for ORFs.
  • Is this a good approach for the eukaryotic
    genome?
  • Why or why not?

10
So, whats the problem with looking for ORFs?
  • split genes make it difficult to define ORFs
  • Where are the stops and stops?
  • What problems do introns introduce?
  • What would you predict for the size of ORFs?
  • (you cant with any certainty!)

11
Most Programs Concentrate on Finding Exons
  • Exon the region of DNA within a gene that codes
    for a polypeptide chain or domain
  • Intron non-coding sequences found in the
    structural genes

12
Splice Sites used to Define Exons
  • Splice donor (exon-intron boundary) and splice
    acceptor (intron-exon boundary)
  • are consensus sequences
  • A statistical determination of the
    patternapproximates the pattern
  • C(orA)AG/GTA(orG)AGT "donor" splice site
  • T(orC)nNC(orT)AG/G "acceptor" splice site

13
Gene finding programs look for different types of
exon
  • single exon genes begin with start codon end
    with stop codon
  • initial exons begin with start codon end with
    donor site
  • internal exons begin with acceptor end with
    donor
  • terminal exons begin with acceptor end with
    stop codon

14
How are correct splice sites identified?
  • There are many occurrences of GT or AG within
    introns that are not splice sites
  • Statistical profiles of splice sites are used

http//www.lclark.edu/lycan/Bio490/pptpresentatio
ns/mutation/sld016.htm
15
Other Biologically Important Signals Used in Gene
Finding Programs
  • Transcriptional Signals
  • Transcription Start characterized by cap signal
  • A single purine (A/G)
  • TATA box (promoter) at 25 relative to start
  • Polyadenylation signal AATAAA (3 end)
  • Major Caveat not all genes have these signals
  • Makes it difficult to define the beginning and
    end of a gene

16
Upstream Promoter Sites
  • Transcription Factor (TF) sites
  • Transcription factors are sequence-specific
    DNA-binding proteins
  • Bind to consensus DNA sequences
  • e.g. CAAT transcription factor and CAAT box
  • Many of these
  • Vary in sequence, location, interaction with
    other sites
  • Further complicates the problem of delineating a
    gene

17
Translation Signals
  • Kozak sequence
  • The signal for initiation of translation in
    vertebrates
  • Consensus is GCCACCatgG
  • And of course..
  • Translation stop codons

18
Codon Biasin Eukaryotic Genomes
  • Yeast Genome arg specified by AGA 48 of time
    (other five equivalent codons 10 each)
  • Fruitfly Genome arg specified by CGC 33 of time
    (other five 13 each)

19
GC Content in Eukaryotes
  • Overall GC content does not vary between species
    as it does in prokaryotes
  • GC content is still important in gene finding
    algorithms
  • CpG Islands
  • CG dinucleotides occur at low frequency overall
    in the genome
  • Exception CpG islands near promoters
  • CG dinucleotides occur at level predicted by
    chance
  • -1,500 to 500 (relative to transcription start
    site)

20
CpG Islands
  • Occurrence related to methylation
  • Methylation of C in CG dinucleotides
  • Methylation of C makes CpG prone to mutation
    (e.g. to TpG or CpA)
  • Level of methylation is low in actively
    transcribed genes
  • Transcription requires a methyl-free promoter

21
Gene Finding Strategies
  • Homology-based approach
  • Find sequences that are similar to known gene
    sequences
  • ab initio-based approach is to identify genes by
  • Signal sequences
  • Composition

22
List of Gene Finding Programs
  • http//www.hku.hk/bruhk/sggene.html

23
Homology-Based Approaches in Eukaryotic Genomes
  • More complicated than prokaryotes due to split
    genes
  • Genome sequence -gt first identify all candidate
    exons
  • Use a spliced alignment algorithm to explore all
    possible exon assemblies compare to known
  • e.g. Procrustes
  • Limitations
  • must have similar sequence in the database with
    known exon structure
  • Sensitive to frame shift errors

24
Procrustes
  • Gene Recognition via spliced alignment
  • Given a genomic sequence and a set of candidate
    exons, the spliced alignment algorithm explores
    all possible exon assemblies and finds a chain of
    exons with the best fit to a related target
    protein
  • http//hto-13.usc.edu/software/procrustes/salign

25
GenScan
  • Allows integration of multiple types of
    information
  • Earlier programs considered features of gene
    structure in isolation
  • Uses a generalized HMM (one state might use a
    weight matrix model, another an HMM)
  • http//genes.mit.edu/GENSCAN.html

26

GenScan
  • Probabilistic Model of Genes
  • Accounts for many of the known structural
    compositional properties of genes including
  • typical gene density
  • typical number of exons per gene
  • distribution of exon sizes for different types of
    exon
  • compositional properties of coding vs. non-coding
  • translation initiation (Kozak)
  • termination signals
  • TATA box, cap site and poly-adenylation signals
  • donor and acceptor splice sites

27
GenScan
  • Uses as a training set 238 multi-exon genes and
    142 single-exon genes from GenBank to compute
    parameters
  • Initial state probabilities
  • Transition probabilities
  • State length distributions

28
GenScan
  • Probabilistic models for the states
  • The states correspond to different functional
    units on a gene e.g promoter regions, exon
  • Transitions ensure that the order that the model
    marches through the states is biologically
    consistent
  • Length distributions take into account that
    different functional units have different lengths.

29
GenScan
  • Signal models used by GenScan
  • - WMM weight matrix model
  • for transcriptional and translational signals
    (translation initiation, polyadenylation signals,
    TATA box etc.) e.g. polyadenylation signal
    is modeled as a 6 bp WMM with AATAAA as the
    consensus sequence
  • -WAM weight array model assumes some
    dependencies between adjacent positions in the
    sequence
  • e.g. used for the pyrimidine-rich region and
    the splice acceptor site
  • -Maximal dependency decomposition
  • e.g. used for donor splice sites

30
GenScan
  • does not use similarity search
  • uses double stranded genomic sequence model
  • potential genes on both strands are analysed
    simultaneously
  • Limitations
  • cannot handle overlapping transcription unit
  • does not address alternative splicing

31
GRAIL
  • GRAIL (Gene Recognition and Assembly Internet
    Link)
  • uses a number of sensor algorithms to evaluate
    coding potential of a DNA sequence
  • features include 6-mer composition, GC
    composition and splice junction recognition
  • the output of the sensor algorithms is input to
    a neural network, which uses empirical data for
    training.

32
GRAIL-exp
  • http//compbio.ornl.gov/grailexp/gxpfaq1.html
Write a Comment
User Comments (0)
About PowerShow.com