Statistical modeling and classification in Biological Sequence Space - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical modeling and classification in Biological Sequence Space

Description:

Title: PowerPoint Presentation Author: Gene Yeo Last modified by: Gene Yeo Created Date: 4/5/2003 11:23:10 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 33
Provided by: Gene60
Learn more at: http://www.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: Statistical modeling and classification in Biological Sequence Space


1
Statistical modeling and classification in
Biological Sequence Space
  • April 26, 04 9.520
  • Gene Yeo
  • Poggio, Burge _at_MIT

2
Framework/Issues
  • Build models around known biology
  • In the process, extend knowledge about known
    biology
  • Predict new examples
  • Validate predictions by
  • prediction accuracy
  • experimental validation
  • higher-level traits of predictions
  • conservation in other genomes

3
Biological sequences
  • DNA, RNA and proteins macromolecules built up
    from smaller units.
  • DNA units are the nucleotide residues A, C, G
    and T
  • RNA units are the nucleotide residues A, C,
    G and U
  • Proteins units are the amino acid residues A,
    C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T,
    V, W and Y.
  • To a considerable extent, the chemical
    properties of DNA, RNA and protein molecules are
    encoded in the linear sequence of these basic
    units their primary structure.

4
  • Statistical models can be descriptive and/or
    predictive.
  • Given known biological signal-gt describe the
    signal with statistical modeling find unknown
    examples of the same signal
  • Gene-finding (protein-coding genes)
  • Noncoding RNA genes
  • Protein domains
  • Warning although successful, models are not to
    be taken literally.
  • Most important biological confirmation of
    predictions is almost always necessary.

5
Sequences are full of signals!
ACGTAGCTAGCATGCATGCATGACTACGATCGACTACGATCAACGATGCA
TGCATCGACTACGATCAGCTACGATCAGCATCGACTAGCATCGATCAGCA
TCGATCAGCATCGACTAGCTACGACTAGCGCTAC
How do we model/describe these motifs ?
6
Different models
RNA gene (Covariation,SCFG,NN,SVM)
Protein structure (a variety of methods)
Complexity
Protein gene(HMM,NN)
Splice site motif (WMM, MM, SVM, NN)
DNA RNA
Protein
7
Modeling dependencies in biological sequence
motifs
Object Model
Assumptions
Weight Matrix Model (WMM)
Independence (easy)
Hidden Markov Model (HMM)
Local dependence (medium)
Non-local Pairwise Dependence (hard)
Stochastic Context-Free Grammar (SCFG)
8
A case study in computational biology modeling
signals in genes
With so many genomes being sequenced, it
remains important to be able to identify genes
and the signals within and around genes
computationally.
9
What is a (protein-coding) gene?
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
10
What is a gene, ctd?
  • In general the transcribed sequence is longer
    than the translated portion parts called introns
    (intervening sequence) are removed, leaving exons
    (expressed sequence), and yet other regions
    remain untranslated. The translated sequence
    comes in triples called codons, beginning and
    ending with a unique start (ATG) and one of three
    stop (TAA, TAG, TGA) codons.
  • There are also characteristic intron-exon
    boundaries called splice donor and acceptor
    sites, and a variety of other motifs promoters,
    transcription start sites, polyA sites,branching
    sites, and so on.
  • All of the foregoing have statistical
    characterizations.

11
(No Transcript)
12
Some facts about human genes
  • Comprise about 3 of the genome
  • Average gene length 8,000 bp
  • Average of 5-6 exons/gene
  • Average exon length 200 bp
  • Average intron length 2,000 bp
  • 8 genes have a single exon

13
The idea behind a HMM genefinder
  • States represent standard gene features
    intergenic region, exon, intron, perhaps more
    (promotor, 5UTR, 3UTR, Poly-A,..).
  • Observations embody state-dependent statistics,
    such as base composition, dependence, and signal
    features.

14
GENSCAN (Burge Karlin)
15
a simple genefinder
16
Splice sites can be an important signal
17
Regular expressions can be limiting
C A
A G
AGGT AGT
5 splice junction in eukaryotes
T C
T C
N AGC
3 splice junction
11
Most protein binding sites are characterized by
some degree of sequence specificity, but
seeking a consensus sequence is often an
inadequate way to recognize sites.
Position-specific distributions came to represent
the variability in motif composition.
18
Position-specific scoring matrix (PSSM)
6
5
4
3
2
1
-1
-2
-3
Pos
0.1
0.1
0.7
0.4
0.0
0.0
0.1
0.6
0.3
A
0.2
0.1
0.1
0.1
0.0
0.0
0.0
0.1
0.4
C
0.2
0.8
0.1
0.4
0.0
1.0
0.8
0.2
0.2
G
0.5
0.0
0.1
0.1
1.0
0.0
0.1
0.1
0.1
T
19
Ok, so we got the genes
  • molecular biology (transcription, splicing)
  • signals are modeled as states (HMM) or
    separately, i.e.PSSMs
  • Heres another catch, there isnt just one
    version of each gene.
  • But sometimes several

20
Eg. alternative splicing - CD44
Human chromosome 11p
Zhu et al Science (2003)
21
Alternative splicing
  • is a major determinant of protein diversity
    (Lander 2001, Zavolan 2003)
  • 30-50 of human diseases involve alt. splicing

22
Defining constitutive and alternative exons
Constitutive exon Skipped exon 3 alternative
exon 5 alternative exon Intron
retention Mutually exclusive exons
23
Conserved alternative, skipped exon - FXR1
Fragile X Related Gene, FXR1
24
Another example of genes containing CSE DMWD
Myotonic Dystrophy-containing WD Repeat, DMWD
25
Predicting new alternatively spliced exons
  1. The problem is ill-posed
  2. High-dimensional space
  3. Not overfit data
  4. Simple feature selection
  5. Unbalanced data set sizes
  6. Labels are more flexible

26
Eg. of experimentally validated
27
Biological sequence space challenges
  • Models that represent as much of the biology as
    possible.
  • Biologically motivated features are important
  • Validating attributes
  • Conservation of events are key in computational
    biology
  • Higher-level consistency with known biology
  • Experimental validation of predictions are
    essential

28
Framework/Issues
  • Build models around known biology
  • In the process, extend knowledge about known
    biology
  • Predict new examples
  • Validate predictions by
  • prediction accuracy
  • experimental validation
  • higher-level traits of predictions
  • conservation in other genomes

29
Modeling higher order interactions Yeast Phe tRNA
If time permits
Secondary Structure
Tertiary Structure
30
The Hammerhead Ribozyme
Secondary structure
Tertiary structure
31
One example on how to model and predict RNA 2o
Structure
Covariation (using comparative genomics)
Seq1 A C G A A A G U Seq2 U A G U A A U
A Seq3 A G G U G A C U Seq4 C G G C A A U
G Seq5 G U G G G A A C
32
Mutual information statistic for pair of columns
in a multiple alignment
fraction of seqs w/ nt. x in col. i, nt. y
in col. j
fraction of seqs w/ nt. x in col. i
sum over x, y A, C, G, U
33
Inferring 2o Structure from Covariation
34
Stochastic Context-Free Grammars (SCFGs)
A generalized model which is capable of
handling non-local dependencies between words in
a language (or bases in an RNA)
Ref Durbin et al. Biological Sequence
Analysis 1998
35
An SCFG Model of RNA 2o Structure
Production Rules P ?? ?aWb (pair)
L ?? ?aW (left bulge/loop) R ?? ?Wa
(right bulge/loop) B ?? SS
(bifurcation) S ?? W (start) E ?? ?
(end)
36
last page
  • some of the slides were obtained from various
    places
  • available online slides on the web (primarily
    from lectures by terry speed).
  • slides from chris burge, dirk holste
Write a Comment
User Comments (0)
About PowerShow.com