CSE182-L12 - PowerPoint PPT Presentation

About This Presentation
Title:

CSE182-L12

Description:

Twinscan: Combines comparative and de novo approach. Databases ... De novo Gene prediction: Summary. Various signals distinguish coding regions from non-coding ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 37
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE182-L12


1
CSE182-L12
  • Gene Finding

2
Silly Quiz
  • Who are these people, and what is the occasion?

3
Gene Features
4
DNA Signals
  • Coding versus non-coding
  • Splice Signals
  • Translation start

5
PWMs
321123456 AAGGTGAGT CCGGTAAGT GAGGTGAGG TAGGTAAGG
  • Fixed length for the splice signal.
  • Each position is generated independently
    according to a distribution
  • Figure shows data from gt 1200 donor sites

6
MDD
  • PWMs do not capture correlations between
    positions
  • Many position pairs in the Donor signal are
    correlated

7
MDD method
  • Choose the position i which has the highest
    correlation score.
  • Split sequences into two those which have the
    consensus at position i, and the remaining.
  • Recurse until ltTerminating conditionsgt

8
MDD for Donor sites
9
Gene prediction Summary
  • Various signals distinguish coding regions from
    non-coding
  • HMMs are a reasonable model for Gene structures,
    and provide a uniform method for combining
    various signals.
  • Further improvement may come from improved signal
    detection

10
How many genes do we have?
Nature
Science
11
Alternative splicing
12
Comparative methods
  • Gene prediction is harder with alternative
    splicing.
  • One approach might be to use comparative methods
    to detect genes
  • Given a similar mRNA/protein (from another
    species, perhaps?), can you find the best parse
    of a genomic sequence that matches that target
    sequence
  • Yes, with a variant on alignment algorithms that
    penalize separately for introns, versus other
    gaps.

13
Comparative gene finding tools
  • Genscan/Genie
  • Procrustes/Sim4 mRNA vs. genomic
  • Genewise proteins versus genomic
  • CEM genomic versus genomic
  • Twinscan Combines comparative and de novo
    approach.

14
Databases
  • RefSeq and other databases maintain sequences of
    full-length transcripts.
  • We can query using sequence.

15
De novo Gene prediction Summary
  • Various signals distinguish coding regions from
    non-coding
  • HMMs are a reasonable model for Gene structures,
    and provide a uniform method for combining
    various signals.
  • Further improvement may come from improved signal
    detection

16
How many genes do we have?
Nature
Science
17
Alternative splicing
18
Comparative methods
  • Gene prediction is harder with alternative
    splicing.
  • One approach might be to use comparative methods
    to detect genes
  • Given a similar mRNA/protein (from another
    species, perhaps?), can you find the best parse
    of a genomic sequence that matches that target
    sequence
  • Yes, with a variant on alignment algorithms that
    penalize separately for introns, versus other
    gaps.

19
Comparative gene finding tools
  • Procrustes/Sim4 mRNA vs. genomic
  • Genewise proteins versus genomic
  • CEM genomic versus genomic
  • Twinscan Combines comparative and de novo
    approach.

20
Course
  • Sequence Comparison (BLAST other tools)
  • Protein Motifs
  • Profiles/Regular Expression/HMMs
  • Protein Sequence Identification via Mass Spec.
  • Discovering protein coding genes
  • Gene finding HMMs
  • DNA signals (splice signals)

21
Genome Assembly
22
DNA Sequencing
  • DNA is double-stranded
  • The strands are separated, and a polymerase is
    used to copy the second strand.
  • Special bases terminate this process early.

23
  • A break at T is shown here.
  • Measuring the lengths using electrophoresis
    allows us to get the position of each T
  • The same can be done with every nucleotide. Color
    coding can help separate different nucleotides

24
  • Automated detectors read the terminating bases.
  • The signal decays after 1000 bases.

25
Sequencing Genomes Clone by Clone
  • Clones are constructed to span the entire length
    of the genome.
  • These clones are ordered and oriented correctly
    (Mapping)
  • Each clone is sequenced individually

26
Shotgun Sequencing
  • Shotgun sequencing of clones was considered
    viable
  • However, researchers in 1999 proposed shotgunning
    the entire genome.

27
Library
  • Create vectors of the sequence and introduce them
    into bacteria. As bacteria multiply you will have
    many copies of the same clone.

28
Sequencing
29
Questions
  • Algorithmic How do you put the genome back
    together from the pieces? Will be discussed in
    the next lecture.
  • Statistical? How many pieces do you need to
    sequence, etc.?
  • The answer to the statistical questions had
    already been given in the context of mapping, by
    Lander and Waterman.

30
Lander Waterman Statistics
Island
L
G
31
LW statistics questions
  • As the coverage c increases, more and more areas
    of the genome are likely to be covered. Ideally,
    you want to see 1 island.
  • Q1 What is the expected number of islands?
  • Ans N exp(-c?)
  • The number increases at first, and gradually
    decreases.

32
Analysis Expected Number Islands
  • Computing Expected islands.
  • Let Xi1 if an island ends at position i, Xi0
    otherwise.
  • Number of islands ?i Xi
  • Expected islands E(?i Xi) ?i E(Xi)

33
Prob. of an island ending at i
i
L
T
  • E(Xi) Prob (Island ends at pos. i)
  • Prob(clone began at position i-L1
  • AND no clone began in the next L-T positions)

34
LW statistics
  • PrIsland contains exactly j clones?
  • Consider an island that has already begun. With
    probability e-c?, it will never be continued.
    Therefore
  • PrIsland contains exactly j clones
  • Expected j-clone islands

35
Expected of clones in an island
Why?
36
Expected length of an island
Write a Comment
User Comments (0)
About PowerShow.com