I' Gene Finding - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

I' Gene Finding

Description:

I.1. Introduction: The importance of gene finding. Complete sequencing ... E.g., Tetradon nigroviridis (Zebrafish) is a vertebrate with very few introns. TSS ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 42
Provided by: stephe87
Category:

less

Transcript and Presenter's Notes

Title: I' Gene Finding


1
I. Gene Finding
  • I.1. Introduction
  • I.2. Approaches to (Protein-Coding) Gene Finding
  • I.2.1 Signature of Exons and non-Exons
  • I.2.2 Approaches
  • I.3. Other Topics
  • I.3.1. Alternative Splicing
  • I.3.2. RNA Gene Prediction
  • I.3.3. Other Gene Components Prediction
  • I.3.4. Gene Ontology Prediction
  • --------------------------------------------------
    -----------------
  • Readings Gene Annotation Prediction and Testing

2
I.1. Introduction The importance of gene finding
  • Complete sequencing of human genome. SO WHAT??!!!
  • Given the sequence of a genome, we would like to
    be able to identify
  • Genes
  • Exon boundaries splice sites
  • Beginning and end of translation
  • Alternative splicings
  • Regulatory elements (e.g. promoters)
  • Subsequently, gene regulatory network, signaling
    network, metabolic network.
  • Only certain way to do this is experimentally,
    but computational methods can achieve reasonable
    accuracy quickly, and help direct experimental
    approaches.

3
I.1. Introduction Eukaryotic gene structure
4
I.1 Introduction NCBI RefSeq Sequences
  • 5 years ago, estimated gt 100,000 human genes.
    Now, 30000 ? 25000
  • NCBI RefSeq
  • RefSeq Code http//www.ncbi.nlm.nih.gov/RefSeq/ke
    y.htmlstatus
  • Only 25,000? Well, isoforms.

????, not very reliable
Probably ok
Good
5
I.1 Introduction Gene Prediction
  • There is no (yet known) perfect method for
    finding genes. All approaches rely on combining
    various weak signals together
  • Find elements of a gene
  • coding sequences (exons) finding splice site
  • start signals Translation Initiation Site (TIS),
    Transcription Start Site (TSS)
  • Promoter and other transcription factor binding
    site
  • poly-A tails and downstream signals
  • Assemble into a consistent gene model

6
I.1 Introduction General challenges
  • Short exons are hard to find, but not uncommon.
    Shortest human exon is 3bp!
  • First and last exons are particularly difficult
  • No bounding intron, so no splice site signal
    (although start and end codons are related
    signals)
  • Generally contain non-coding sequence as well as
    coding sequence, so hexamer signals are diluted.
  • Alternative splicing (isoforms) means that there
    are multiple true solutions for some genes.
  • For human genome, long introns of few thousands
    are very common, making problem extremely
    difficult.

7
I.1. Introduction The importances of splice site
prediction
Perfect Prediction (ignoring isoform for now)
Missing out a gene
False negative
Computationally expansive
False positive
8
I.2.1 Signature Signatures of exons and non-exons
  • Signatures of splice sites (exons)
  • CpG islands
  • intron splice junctions
  • (hexa)nucleotide frequencies
  • Signatures of non-exons (repeats, ALUs, etc.)
  • Compatibility with surrounding sequences other
    exons (e.g. consistent reading frame)

9
I.2.1 Signature CpG Islands
  • CpG islands are regions of sequence that have a
    high proportion of CG dinucleotide pairs (p is a
    phoshodiester bond linking them)
  • CpG islands are present in the promoter and
    exonic regions of approximately 40 of mammalian
    genes
  • Other regions of the mammalian genome contain few
    CpG dinucleotides and these are largely
    methylated
  • Methylation disables transcription factor binding
  • Definition sequences of gt500 bp with
  • GC ? 55
  • Observed(CpG)/Expected(CpG) ? 0.65

10
I.2.1 Signature Splice junctions
  • Most Eukaryotic introns have a consensus splice
    signal GU at the beginning (donor), AG at the
    end (acceptor).
  • Variation does occur in the splice sites
  • Many AGs and GTs are not splice sites.
  • Recently, exon enhancers and insulators

11
I.2.1 Signature Hexanucleotide frequencies
  • Amino acid distributions are biased e.g. p(A) gt
    p(C)
  • Pairwise distributions also biasede.g.
    p(AT)/p(A)p(T) gt p(AC)/p(A)p(C)
  • Nucleotides that code for preferred amino acids
    (and AA pairs) occur more frequently in coding
    regions than in non-coding regions.
  • Codon biases (per amino acid)
  • Hexanucleotide distributions that reflect those
    biases indicate coding regions.
  • Statistics has to be gathered over large amount
    of sequences.

12
I.2.1 Signature Issues in 6mer frequency
  • Sliding window across all 6 reading frames.
  • Significance of a score?
  • In order to get good statistics on hexamer
    frequencies in an ORF, it has to be long
  • Amino acid dimer (and nucleotide hexamer)
    frequencies vary by organism.
  • Using frequencies from one organism (or a
    consensus) for another gives a noisier signal yet.

13
1.2.2. Splice Site Prediction Inductive Learning
  • The inference of general rules from a set of
    examples a classic problem in CS, statistics...
  • Training examples
  • Must be representative (random is best) and
    labeled
  • Need an adequate number
  • Sometimes benefit from positive and negative
    instances
  • Representation (what aspects of examples?)
  • Kind of rule to be induced (e.g. linear)
  • Algorithms for induction
  • Decision Tree
  • Support Vector Machine
  • Artificial Neural Networks
  • Bayesian Learning (Bayesian Network)
  • Instance-Based Learning
  • The key is __________ not so much in __________

14
1.2.2. Splice Site Prediction Representation
  • Most important aspect of inductive learning
  • Most common fixed length feature vector.
  • Feature observable value related to task
  • Fixed length vector a particular list of
    features which has the same meaning from example
    to example
  • Some feature sets (like sequences) have variable
    length or can have variable meaning
  • Can translate by sliding window
  • Position Weighted Matrix (PWM)
  • Non-positional consensus sequences
  • Need all the relevant features and not too many
    irrelevant ones (for amount of data)

15
1.2.2. Splice Site Prediction High Imbalanced
ratio
  • The difficulty High Imbalanced Ratio.
  • Assuming that all nucleotides have equal chance
    of appearing. Then roughly speaking, the number
    of expected GT is gt 3,000,000,000/16
  • Given that human has about 25000 genes and each
    gene on the average has 10 exons. The number of
    actual donor site is about 250,000.
  • Thus, the imbalanced ratio is roughly
  • 16 250,000 3,000,000,000 4 3,000 1750!

16
1.2.2. Splice Site Prediction Approaches to
deadling with high imbalanced ratio
  • In bioinformatics
  • Under-sample the majority class
  • Splicing Predict splicing with known gene
    boundary. Imbalanced ratio reduced by 1750 to
    130. Assuming that coding sequence occupy 2 of
    our genome.

17
1.2.2. Splice Site Prediction Better ways of
deadling with Imbalanced data
  • The basic idea
  • First cluster the majority classes, hopefully
    there is a large collection of clusters with
    almost pure majority.
  • If an unknown instance fall into one of these
    clusters, then we say that the instance is a
    negative instance.

-
-
-

-


-
-
-
-
-

-
-
-
-
-
-
18
1.2.2 Splice Site Prediction Performance
Measure for Imbalanced Data
  • If the imbanced ratio is high say 11000 then
    simply say negative (assuming that the majority
    is negative) then accuracy 99.9!
  • More meaningful measures are recall and precision
  • Recall proportion of positive (minority)
    instances that are predicted (i.e. recalled) as
    positive.
  • Precision proportion of positive instances
    predicted correctly.
  • But Recall and Precision are complimentary goals,
    so use F-measure the harmonic mean of recall
    and precision as a comprimising measure.

19
1.2.2. Splice Site Prediction Kihoons Splice
Site Result.
20
1.2.3 Gene Assembly
  • There are three approaches
  • Homology-Based
  • Exon prediction-Based
  • HMM Approach
  • Combined other predictors

21
1.2.3. Gene Assembly Inference by homology
  • For exon finding, we need to find matches to
  • ESTs (start from TSS)
  • SwissProt or other protein databases (start from
    TIS)
  • Known exons
  • Intronless Easy, match all multiple of 3 length
    substrings from AUG to stop codon sequences
    (UAG,UGG,UGA) against EST.
  • For (2), we need to convert the predicted coding
    sequence to protein sequences.
  • For (3), certain organisms are particularly
    useful. E.g., Tetradon nigroviridis
    (Zebrafish) is a vertebrate with very few introns.

May need to backtrack
TSS
22
1.2.3. Gene Assembly ExoFish
  • Idea coding regions are more conserved than
    non-coding regions
  • Uses sequences of Tetraodon nigroviridis,
  • Genome is 8 times smaller than human
  • Almost no repetitive sequence
  • Relatively few introns.
  • Most human genes have a homolog
  • 70 complete sequence

23
1.2.3. Gene Assembly EST_Genome
  • ESTs (Expressed Sequence Tags) are random
    sequencing of fragments of mRNAs. They are not
    complete coding sequences, but are all coding
    sequences
  • dbEST and other large databases exist (from
    varying organisms). There are over 6 million
    ESTs.
  • Matching ESTs to genomic sequence gives important
    clues about coding sequences
  • Needs to handle introns in ESTs gracefully
  • AAT, GeneSeqer, SIM4. Spidey match EST
  • Procrusters. GeneWise, ORFgene, ALN match
    protein
  • The match serve as evidence.

24
1.2.3. Gene Assembly Prediction-Based Method
  • Instead of trying to match EST, build an exon
    predictor.
  • Splice site predictors generate candidate exons.
  • Need to determine real exons in the candidate
    exons.
  • Train a classifier. Convert exons in known genes
    and non-exons (those that have AG in front and
    GT behind that are non-exons) to training
    examples and train a classifier. Each exon is
    assigned a confident score.
  • Use dynamic programming to assemble the exons
    into genes.
  • Example GeneId, GeneView, GAP3, FGENE. DAGGER.
  • Advantage
  • Disadvantage

25
1.2.3. Gene Assembly Hidden Markov Model
  • Instead of predicting splice sites and then try
    to match EST or use the predictor for
    exon-prediction, generate splice sites and
    predict exons together using HMM.

26
1.2.3. Gene Assembly Putting together multiple
predictions
  • None of these methods are ideal, so...
  • Best results comes from agreement among multiple
    approaches
  • Computational integration. Many gene finders now
    including EST or homology information
  • E.g. FGenesH, GenomeScan
  • Visualization environments for viewing multiple
    predictions
  • Genome browsers (UCSC annotation tracks!),
    Genotator

27
1.2.3. Gene Assembly GeneComber
  • Simple, open source PERL/MySQL program finds
    agreement between HMMGene GenScan
  • Accuracy is higher when they agree
  • Common strategy for combining predictions
  • Can also use fancier combining principles, e.g.
    NNs

28
1.3.1. Alternative Splicing
  • Estimated 50 of human genes have alternative
    splicing.
  • Drosophilia Dscam (Down Symndrome cell adhesion
    molecule) -- gt 38,000 isoforms! Present in human
    on 21q22.
  • Recent results have demonstrated the importance
    of alternative splicings (e.g. exon skipping, use
    of partial exons), particularly in mammalian
    genomes
  • Predictions of alternatively spliced gene
    products are valuable
  • Knowledge of alternative splice possibilities can
    be used to improve gene prediction

29
1.3.1. Alternative Splicing EuGÈNE
  • Takes alignments of alternatively spliced
    transcripts, and enumerates all possible paths
    through exons that could generate observations
  • Uses dynamic programming to find a minimal set

30
1.3.1. Alternative Splicing HMM path
distributions
  • Cawley Patcher, 05 used sampled distribution
    of paths through a HMMGene to estimate
    alternative splicing likelihoods
  • Contrast withViterbiassumption!

31
I.3.1. Alternative Aplicing Predictions of
unobserved AS
  • Chris Burges UNCOVER pair HMM (states cover two
    sequences, saying either conserved or not)
  • Looks at one human/mouseorthologous intron to
    estimatewhether it contains a splice
  • Alternative conserved exons detected with high
    accuracy (much better than BLASTing)
  • 100s of novel predictionsverified with PCR

32
I.3.2 RNA gene prediction
  • Not all gene products are proteins! RNA genes
    are significant, too.
  • Hexamer frequency, etc. obviously not going to
    work
  • Generally, different RNA genes have quite
    different characteristics, so there are different
    approaches to each family, e.g.
  • Bacterial RNA genes
  • tRNAs
  • miRNAs

33
I.3.2. RNA gene prediction miRNA predictions
  • Recent discoveries show growing importance of
    micro RNAs in expression regulation
  • Chris Burge (again!) with MiRScan uses SVMs
  • Particularly tricky, since there is no known
    negative set.
  • Uses a set that is probably mostly negative, and
    resampling to estimate a threshold that allows
    some negative examples to become positive.
    Seeded with known positives for testing

34
I.3.3. TIS-predictor Previous result
  • Almost always, starts with ATG.
  • Some papers say It begins with ATG so it is
    trivial.
  • Number of ATGs vs number of genes 13500.
  • Most results based on Pedersen and Nelson data
    set with an imbalanced ratio of 13.

35
I.3.3. Promoter Predictor Previous Result
  • Promoter / TSS prediction
  • TSS prediction is essentially Promoter
    prediction. RNA polymerase needs to bind to
    promoter which is close to TSS.
  • TSS is more difficult since there is no consensus
    sequence like ATG. Imbalanced ratio of gt 25,000
    3,000,000,000 25,000 1 120,000
  • Other consensus sequences have been found
    recently Clustering of DNA sequences in Human
    promoters.
  • Typically place a window
  • Recently, window is reduced to -50 to 50,
    with HMM. Result Recall and Precision about
    20-30 .

4000
TSS
36
I.3.3. Promoter Prediction
  • Some popular classes of consensus sequences

TBP is a subunit of TFIID, which in turns is a
subunit of RNA polymerase II.
37
I.3.3. TIS/TSS/Promoter-Predictors Some ideas.
  • Kozak sequence (GCC)RCCATGG where R is a purine
    (A or G). Possibly other sequences.
  • Kozaks result is fairly old (in the 80s), with
    more genes known and better computational
    techniques, there maybe other sequences.
  • Proposed project Find TIS, TSS, promoter)
  • Use the repeat database (on /mnt/nas1) to focus
    on the areas that may contain genes. Perhaps
    start with Chromosome 22 first.
  • Try to find non-positional consensus sequences as
    features. (Kihoon is developing a generic program
    for it)
  • Posibility of linking the two predictors If
    there is no predicted TIS nearby then promoter
    cannot exist. Need to compute statistics on how
    far TIS is away from TSS (i.e. length of 5 UTR).

38
I.3.3. TIS/TSS/Promoter Predictors Some other
generalizations
  • Similar ideas can be applied to predict 3 end
    UTR. Somehow polyA tail prediction receives less
    attention but recently, talking to UTHSCSA
    people, seems to be interesting.
  • For Promoter prediction perhaps predict the
    subclass, but need to do some background work to
    see if the subclass is interesting.
  • NOTE you have to give me the background! Try to
    meet me and read at the very least 3-4 latest
    papers! October 18! Which is 2 weeks time! Need
    to meet me twice no kidding!

39
I.3.4. Beyond Gene Finding Gene Ontology
Predictions
  • Simply finding genes is not good enough. Need to
    know their
  • Function What they do?
  • Localization Where they do it?
  • Process How they do it?
  • To this end, Gene Ontology Consortium comes up
    with a list of vocabulary to describe gene
    product attributes.
  • Problems with current gene prediction
  • Extremely high accuracy for multiclass problem,
    gt 80. Problem with training data and homology
    search
  • Perhaps an intermediate prediction is to predict
    signal peptide and what signals does it provide.

40
I.3.4 Beyond Gene Finding Regulatory Network
Networking
  • Regulatory network How genes control other genes
    (mainly through the production of transcription
    factors or subunits of transcription factors).

41
I.3.4. Beyond Gene Finding Other Networks
  • Metabolic Pathway How cell takes in raw
    material (sugar)
  • Signal Transduction How cell receives and
    transmit signals
  • Currently, we are only interested in partial
    regulatory network that involve regulation vis
    promoter control.
  • Beyond this, tissue to tissue communication. Way
    to go!
Write a Comment
User Comments (0)
About PowerShow.com