Gene Prediction - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Gene Prediction

Description:

Chapter 8 Gene Prediction * * * * * * * * * * * * * * * * * * * * * * * * * * * * Automated sequencing of genomes require automated gene assignment Includes detection ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 31
Provided by: patt86
Category:

less

Transcript and Presenter's Notes

Title: Gene Prediction


1
Chapter 8 Gene Prediction
2
  • Automated sequencing of genomes require automated
    gene assignment
  • Includes detection of open reading frames (ORFs)
  • Identification of the introns and exons
  • Gene prediction a very difficult problem in
    pattern recognition
  • Coding regions generally do not have conserved
    sequences
  • Much progress made with prokaryotic gene
    prediction
  • Eukaryotic genes more difficult to predict
    correctly

3
  • Ab initio methods
  • Predict genes on given sequence alone
  • Uses gene signals
  • Start/stop codon
  • Intron splice sites
  • Transcription factor binding sitesribosomal
    binding sites
  • Poly-A sites
  • Codon demand multiple of three nucleotides
  • Gene content
  • Nucleotide composition use HMMs
  • Homology based methods
  • Matches to known genes
  • Matches to cDNA
  • Consensus based
  • Uses output from more than one program

4
  • Prokaryotic gene structure
  • ATG (GTG or TTG less frequent) is start codon
  • Ribosome binding site (Shine-Dalgarno sequence)
    complementary to 16S rRNA of ribosome
  • AGGAGGT
  • TAG stop codon
  • Transcription termination site (?-independent
    termination)
  • Stem-loop secondary structure followed by string
    of Ts

5
  • Translate sequence into 6 reading frames
  • Stop codon randomly every 20 codons
  • Look for frame longer that 30 codons (normally
    50-60 codons)
  • Presence of start codon and Shine-Dalgarno
    sequence
  • Translate putative ORF into protein, and search
    databases
  • Non-randomness of 3rd base of codon, more
    frequently G/C
  • Plotting wobble base GC can identify ORFs
  • 3rd base also repeats, thus repetition gives clue
    on gene location

6
  • Markov chains and HMMs
  • Order depends on k previous positions
  • The higher the order of a Markov model to
    describe a gene, the more non-randomness the
    model includes
  • Genes described in codons or hexamers
  • HMMs trained with known genes
  • Codon pairs are often found, thus 6 nucleotide
    patterns often occur in ORFs 5th-order Markov
    chain
  • 5th-order HMM gives very accurate gene
    predictions
  • Problem may be that in short genes there are not
    enough hexamers
  • Interpolated Markov Model (IMM) samples different
    length Markov chains
  • Weighing scheme places less weight on rare k-mers
  • Final probability is the probability of all
    weighted k-mers
  • Typical and atypical genes

7
GeneMark (http//exon.gatech.edu/genemark/) Traine
d on complete microbial genomes Most closely
related organism used for predictions Glimmer
(Gene Locator and Interpolation Markov
Model) (http//www.cbcb.umd.edu/software/glimmer/)
FGENESB (http//linux1.softberry.com/) 5th-order
HMM Trained with bacterial sequences Linear
discriminant analysis (LDA) RBSFinder
(ftp//ftp.tigr.org )Takes output from Glimmer
and searches for S-D sequences close to start
sites
8
(No Transcript)
9
  • Performance evaluation
  • Sensitivity Sn TP/(TPFN)
  • Specificity Sp TP/(TPFP)
  • CCTP.TN-FP.FN/(TPFPTNFNTPTN)1/2

10
Gene prediction in Eukaryotes Low gene density
(3 in humans) Space between genes very large
with multiply repeated sequences and transposable
elements Eukaryotic genes are split
(introns/exons) Transcript is capped (methylation
of 5 residue) Splicing in spliceosome Alternative
splicing Poly adenylation (250 As added)
downstream of CAATAAA(T/C) consensus box Major
issue identification of splicing sites GT-AG rule
(GTAAGT/ Y12NCAG 5/3 intron splice
junctions) Codon use frequencies ATG start
codon Kozak sequence (CCGCCATGG)
11
  • Ab initio programs
  • Gene signals
  • Start/stop
  • Putative splice signals
  • Consensus sequences
  • Poly-A sites
  • Gene content
  • Coding statistics
  • Non-random nucleotide distributions
  • Hexamer frequencies
  • HMMs

12
  • Discriminant analysis
  • Plot 2D graph of coding length versus 3 splice
    site
  • Place diagonal line (LDA) that separates true
    coding from non-coding sequences based on learnt
    knowledge
  • QDA fits quadratic curve
  • FGENES uses LDA
  • MZEF(Michael Zangs Exon Finder uses QDA)

13
  • Neural Nets
  • A series of input, hidden and output layers
  • Gene structure information is fed to input layer,
    and is separated into several classes
  • Hexamer frequencies
  • splice sites
  • GC composition
  • Weights are calculated in the hidden layer to
    generate output of exon
  • When input layer is challenged with new sequence,
    the rules that was generated to output exon is
    applied to new sequence

14
  • HHMs
  • GenScan (http//genes.mit.edu/GENSCAN.html)5th-or
    der HMM
  • Combined hexamer frequencies with coding signals
  • Initiation codons
  • TATA boxes
  • CAP site
  • Poly-A
  • Trained on Arabidopsis and maize data
  • Extensively used in human genome project
  • HMMgene (http//www.cbs.dtu.dk/services/HMMgene)
  • Identified sub regions of exons from cDNA or
    proteins
  • Locks such regions and used HMM extension into
    neighboring regions

15
(No Transcript)
16
(No Transcript)
17
  • Homology based programs
  • Uses translations to search for EST, cDNA and
    proteins in databases
  • GenomeScan (http//genes.mit.edu/genomescan.html)
  • Combined GENSCAN with BLASTX
  • EST2Genome (http//bioweb.pasteur.fr/seqanal/inter
    faces/est2genome.html)
  • Compares EST and cDNA to user sequence
  • TwinScan
  • Similar to GenomeScan

18
(No Transcript)
19
  • Consensus-based programs
  • Uses several different programs to generate lists
    of predicted exons
  • Only common predicted exons are retained
  • GeneComber (http//www.bioinformatics.ubc.ca/genco
    mbver/index.php)
  • Combined HMMgene with GenScan
  • DIGIT (http//digit.gsc.riken.go.jp/cgi-bin/index.
    cgi)
  • Combines FGENESH, GENSCAN and HMMgene

20
Accuracy
Nucleotide Level Nucleotide Level Nucleotide Level Exon Level Exon Level Exon Level Exon Level Exon Level
Sn Sp CC Sn Sp (SnSp)/2 ME WE
FGENES 0.86 0.88 0.83 0.67 0.67 0.67 0.12 0.09
GeneMark 0.87 0.89 0.83 0.53 0.54 0.54 0.13 0.11
Genie 0.91 0.90 0.88 0.71 0.70 0.71 0.19 0.11
GenScAN 0.95 0.90 0.91 0.71 0.70 0.70 0.08 0.09
HMMgene 0.93 0.93 0.91 0.76 0.77 0.76 0.12 0.07
Morgan 0.75 0.74 0.74 0,.46 0.41 0.43 0.20 0.28
MZEF 0.70 0.73 0.66 0.58 0.59 0.59 0.32 0.23
21
Chapter 9 Promoter and regulatory element
prediction
22
  • Promoters are short regions upstream of
    transcription start site
  • Contains short (6-8nt) transcription factor
    recognition site
  • Extremely laborious to define by experiment
  • Sequence is not translated into protein, so no
    homology matching is possible
  • Each promoter is unique with a unique combination
    of factor binding sites thus no consensus
    promoter

23
Prokaryotic gene
TF site
polymerase
TF
ORF
-35 box
-10 box
  • ?70 factor binds to -35 and -10 boxes and recruit
    full polymerase enzyme
  • -35 box consensus sequence TTGACA
  • -10 box consensus sequence TATAAT
  • Transcription factors that activate or repress
    transcription
  • Bind to regulatory elements
  • DNA loops to allow long-distance interactions

24
Eukaryotic gene structure
TF site
Pol II
TF site
TATA
Inr
Polymerase I, II and III Basal transcription
factors (TFIID, TFIIA, TFIIB, etc.) TATA box
(TATA(A/T)A(A/T) Housekeeping genes often do
not contain TATA boxes Initiatior site (Inr)
(C/T) (C/T) CA(C/T) (C/T) coincides with
transcription start Many TF sites Activation/repre
ssion
25
  • Ab initio methods
  • Promoter signals
  • TATA boxes
  • Hexamer frequencies
  • Consensus sequence matching
  • PSSM
  • Numerous FPs
  • HMMs incorporate neighboring information

26
  • Promoter prediction in prokaryotes
  • Find operon
  • Upstream offirst gene is promoter
  • Wang rules (distance between genes, no
    ?-independent termination, number of genomes that
    display linkage)
  • BPROM (http//www.softberry.com)
  • Based of arbitarry setting of operon egen
    distances
  • 200bop uopstream of first gene
  • many FPs
  • FindTerm (http//sun1.softberry.com)
  • Searches for ?-independent termination signals

27
Prediction in eukaryotes
  • Searching for consensus sequences in databases
    (TransFac)
  • Increase specuificity by searching for CpG
    islands
  • High density fo trasncription factor binding
    sitres
  • CpGProD (http//pbil.univ-lyon1.fr/software/cpgpro
    d.html)
  • CG inmoving window
  • Eponine (http//servlet.sanger.ac.uk8080/eponine/
    )
  • Matches TATA box, CCAAT bvox, CpG island to PSSM
  • Cluster-Buster (http//zlab.bu.edu/cluster-buster/
    cbust.html)
  • Detects high concentrations of TF sites
  • FirstEF (http//rulai.cshl.org/tools/FirstEF/)
  • QDA of fisrt exonboundary
  • McPromoter (http//genes.mit.edu/McPromoter.html)
  • Neural net of DNA bendability, TAT box,initator
    box
  • Trained for Drosophila and human sequences

28
Phylogenetic footprinting technique
  • Identify conserved regulatory sites
  • Human-chimpanzee too close
  • Human fish too distant
  • Human0-mouse appropriate
  • ConSite (http//mordor.cgb.ki.se/cgi-bin/CONSITE/c
    onsite)
  • Align two sequences by global alignment
    algorithm
  • Identify conserved regions and compare to
    TRANSFAC database
  • High scoring hits returned as positives
  • rVISTA (http//rvista.dcode.org)
  • Identified TRANSFAC sites in two orthologous
    sequences
  • Aligns sequences with local alignment algorithm
  • Highest identity regions returned as hits
  • Bayes aligner (http//www.bioinfo.rpi.edu/applicat
    ions/bayesian/bayes/bayes.align12.pl)
  • Aligns two sequences with Bayesian algorithm
  • Even weakly conserved regions identified

29
Expression-profiling based method
Microarray analyses allows identification of
co-regulated genes Assume that promoters contain
similar regulatory sites Find such sites by EM
and Gibbs sampling using iteration of
PSSM Co-expressed genes may be regulated at
higher levels MEME (http//meme.sdsc.edu/meme/webs
ite/meme-intro.html) AlignACE (http//atlas.med.ha
rvard.edu/cgi-bin/alignace.pl) Gibbs sampling
algorithm
30
Web humour
Write a Comment
User Comments (0)
About PowerShow.com