Gene Prediction - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Gene Prediction

Description:

TATA boxes. CAP site. Poly-A. Trained on Arabidopsis ... TATA boxes. Hexamer frequencies. Consensus sequence matching ... Matches TATA box, CCAAT bvox, CpG ... – PowerPoint PPT presentation

Number of Views:186
Avg rating:3.0/5.0
Slides: 31
Provided by: patt86
Category:
Tags: gene | prediction

less

Transcript and Presenter's Notes

Title: Gene Prediction


1
Chapter 8 Gene Prediction
2
  • Automated sequencing of genomes require automated
    gene assignment
  • Includes detection of open reading frames (ORFs)
  • Identification of the introns and exons
  • Gene prediction a very difficult problem in
    pattern recognition
  • Coding regions generally do not have conserved
    sequences
  • Much progress made with prokaryotic gene
    prediction
  • Eukaryotic genes more difficult to predict
    correctly

3
  • Ab initio methods
  • Predict genes on given sequence alone
  • Uses gene signals
  • Start/stop codon
  • Intron splice sites
  • Transcription factor binding sitesribosomal
    binding sites
  • Poly-A sites
  • Codon demand multiple of three nucleotides
  • Gene content
  • Nucleotide composition use HMMs
  • Homology based methods
  • Matches to known genes
  • Matches to cDNA
  • Consensus based
  • Uses output from more than one program

4
  • Prokaryotic gene structure
  • ATG (GTG or TTG less frequent) is start codon
  • Ribosome binding site (Shine-Dalgarno sequence)
    complementary to 16S rRNA of ribosome
  • AGGAGGT
  • TAG stop codon
  • Transcription termination site (?-independent
    termination)
  • Stem-loop secondary structure followed by string
    of Ts

5
  • Translate sequence into 6 reading frames
  • Stop codon randomly every 20 codons
  • Look for frame longer that 30 codons (normally
    50-60 codons)
  • Presence of start codon and Shine-Dalgarno
    sequence
  • Translate putative ORF into protein, and search
    databases
  • Non-randomness of 3rd base of codon, more
    frequently G/C
  • Plotting wobble base GC can identify ORFs
  • 3rd base also repeats, thus repetition gives clue
    on gene location

6
  • Markov chains and HMMs
  • Order depends on k previous positions
  • The higher the order of a Markov model to
    describe a gene, the more non-randomness the
    model includes
  • Genes described in codons or hexamers
  • HMMs trained with known genes
  • Codon pairs are often found, thus 6 nucleotide
    patterns often occur in ORFs 5th-order Markov
    chain
  • 5th-order HMM gives very accurate gene
    predictions
  • Problem may be that in short genes there are not
    enough hexamers
  • Interpolated Markov Model (IMM) samples different
    length Markov chains
  • Weighing scheme places less weight on rare k-mers
  • Final probability is the probability of all
    weighted k-mers
  • Typical and atypical genes

7
GeneMark (http//exon.gatech.edu/genemark/) Traine
d on complete microbial genomes Most closely
related organism used for predictions Glimmer
(Gene Locator and Interpolation Markov
Model) (http//www.cbcb.umd.edu/software/glimmer/)
FGENESB (http//linux1.softberry.com/) 5th-order
HMM Trained with bacterial sequences Linear
discriminant analysis (LDA) RBSFinder
(ftp//ftp.tigr.org )Takes output from Glimmer
and searches for S-D sequences close to start
sites
8
(No Transcript)
9
  • Performance evaluation
  • Sensitivity Sn TP/(TPFN)
  • Specificity Sp TP/(TPFP)
  • CCTP.TN-FP.FN/(TPFPTNFNTPTN)1/2

10
Gene prediction in Eukaryotes Low gene density
(3 in humans) Space between genes very large
with multiply repeated sequences and transposable
elements Eukaryotic genes are split
(introns/exons) Transcript is capped (methylation
of 5 residue) Splicing in spliceosome Alternative
splicing Poly adenylation (250 As added)
downstream of CAATAAA(T/C) consensus box Major
issue identification of splicing sites GT-AG rule
(GTAAGT/ Y12NCAG 5/3 intron splice
junctions) Codon use frequencies ATG start
codon Kozak sequence (CCGCCATGG)
11
  • Ab initio programs
  • Gene signals
  • Start/stop
  • Putative splice signals
  • Consensus sequences
  • Poly-A sites
  • Gene content
  • Coding statistics
  • Non-random nucleotide distributions
  • Hexamer frequencies
  • HMMs

12
  • Discriminant analysis
  • Plot 2D graph of coding length versus 3 splice
    site
  • Place diagonal line (LDA) that separates true
    coding from non-coding sequences based on learnt
    knowledge
  • QDA fits quadratic curve
  • FGENES uses LDA
  • MZEF(Michael Zangs Exon Finder uses QDA)

13
  • Neural Nets
  • A series of input, hidden and output layers
  • Gene structure information is fed to input layer,
    and is separated into several classes
  • Hexamer frequencies
  • splice sites
  • GC composition
  • Weights are calculated in the hidden layer to
    generate output of exon
  • When input layer is challenged with new sequence,
    the rules that was generated to output exon is
    applied to new sequence

14
  • HHMs
  • GenScan (http//genes.mit.edu/GENSCAN.html)5th-or
    der HMM
  • Combined hexamer frequencies with coding signals
  • Initiation codons
  • TATA boxes
  • CAP site
  • Poly-A
  • Trained on Arabidopsis and maize data
  • Extensively used in human genome project
  • HMMgene (http//www.cbs.dtu.dk/services/HMMgene)
  • Identified sub regions of exons from cDNA or
    proteins
  • Locks such regions and used HMM extension into
    neighboring regions

15
(No Transcript)
16
(No Transcript)
17
  • Homology based programs
  • Uses translations to search for EST, cDNA and
    proteins in databases
  • GenomeScan (http//genes.mit.edu/genomescan.html)
  • Combined GENSCAN with BLASTX
  • EST2Genome (http//bioweb.pasteur.fr/seqanal/inter
    faces/est2genome.html)
  • Compares EST and cDNA to user sequence
  • TwinScan
  • Similar to GenomeScan

18
(No Transcript)
19
  • Consensus-based programs
  • Uses several different programs to generate lists
    of predicted exons
  • Only common predicted exons are retained
  • GeneComber (http//www.bioinformatics.ubc.ca/genco
    mbver/index.php)
  • Combined HMMgene with GenScan
  • DIGIT (http//digit.gsc.riken.go.jp/cgi-bin/index.
    cgi)
  • Combines FGENESH, GENSCAN and HMMgene

20
Accuracy
21
Chapter 9 Promoter and regulatory element
prediction
22
  • Promoters are short regions upstream of
    transcription start site
  • Contains short (6-8nt) transcription factor
    recognition site
  • Extremely laborious to define by experiment
  • Sequence is not translated into protein, so no
    homology matching is possible
  • Each promoter is unique with a unique combination
    of factor binding sites thus no consensus
    promoter

23
Prokaryotic gene
TF site
polymerase
TF
ORF
-35 box
-10 box
  • ?70 factor binds to -35 and -10 boxes and recruit
    full polymerase enzyme
  • -35 box consensus sequence TTGACA
  • -10 box consensus sequence TATAAT
  • Transcription factors that activate or repress
    transcription
  • Bind to regulatory elements
  • DNA loops to allow long-distance interactions

24
Eukaryotic gene structure
TF site
Pol II
TF site
TATA
Inr
Polymerase I, II and III Basal transcription
factors (TFIID, TFIIA, TFIIB, etc.) TATA box
(TATA(A/T)A(A/T) Housekeeping genes often do
not contain TATA boxes Initiatior site (Inr)
(C/T) (C/T) CA(C/T) (C/T) coincides with
transcription start Many TF sites Activation/repre
ssion
25
  • Ab initio methods
  • Promoter signals
  • TATA boxes
  • Hexamer frequencies
  • Consensus sequence matching
  • PSSM
  • Numerous FPs
  • HMMs incorporate neighboring information

26
  • Promoter prediction in prokaryotes
  • Find operon
  • Upstream offirst gene is promoter
  • Wang rules (distance between genes, no
    ?-independent termination, number of genomes that
    display linkage)
  • BPROM (http//www.softberry.com)
  • Based of arbitarry setting of operon egen
    distances
  • 200bop uopstream of first gene
  • many FPs
  • FindTerm (http//sun1.softberry.com)
  • Searches for ?-independent termination signals

27
Prediction in eukaryotes
  • Searching for consensus sequences in databases
    (TransFac)
  • Increase specuificity by searching for CpG
    islands
  • High density fo trasncription factor binding
    sitres
  • CpGProD (http//pbil.univ-lyon1.fr/software/cpgpro
    d.html)
  • CG inmoving window
  • Eponine (http//servlet.sanger.ac.uk8080/eponine/
    )
  • Matches TATA box, CCAAT bvox, CpG island to PSSM
  • Cluster-Buster (http//zlab.bu.edu/cluster-buster/
    cbust.html)
  • Detects high concentrations of TF sites
  • FirstEF (http//rulai.cshl.org/tools/FirstEF/)
  • QDA of fisrt exonboundary
  • McPromoter (http//genes.mit.edu/McPromoter.html)
  • Neural net of DNA bendability, TAT box,initator
    box
  • Trained for Drosophila and human sequences

28
Phylogenetic footprinting technique
  • Identify conserved regulatory sites
  • Human-chimpanzee too close
  • Human fish too distant
  • Human0-mouse appropriate
  • ConSite (http//mordor.cgb.ki.se/cgi-bin/CONSITE/c
    onsite)
  • Align two sequences by global alignment
    algorithm
  • Identify conserved regions and compare to
    TRANSFAC database
  • High scoring hits returned as positives
  • rVISTA (http//rvista.dcode.org)
  • Identified TRANSFAC sites in two orthologous
    sequences
  • Aligns sequences with local alignment algorithm
  • Highest identity regions returned as hits
  • Bayes aligner (http//www.bioinfo.rpi.edu/applicat
    ions/bayesian/bayes/bayes.align12.pl)
  • Aligns two sequences with Bayesian algorithm
  • Even weakly conserved regions identified

29
Expression-profiling based method
Microarray analyses allows identification of
co-regulated genes Assume that promoters contain
similar regulatory sites Find such sites by EM
and Gibbs sampling using iteration of
PSSM Co-expressed genes may be regulated at
higher levels MEME (http//meme.sdsc.edu/meme/webs
ite/meme-intro.html) AlignACE (http//atlas.med.ha
rvard.edu/cgi-bin/alignace.pl) Gibbs sampling
algorithm
30
Web humour
Write a Comment
User Comments (0)
About PowerShow.com