Title: Searching for genes and biologically related signals in DNA sequences
1 Searching for genes and biologically related
signals in DNA sequences
- Mihaela Pertea
- Assistant Research Scientist
- CBCB
2Outline
- Introduction to the signal prediction problem
- Signal sensors in the context of gene finding
- Splice site prediction with GeneSplicer
- Exon splicing enhancers
3Signal Definition
Signals short sequence patterns in the genomic
DNA that are recognized by the cellular machinery.
4Signals Delimit Gene Features
Coding segments (CDSs) of genes are delimited by
four types of signals start codons (ATG in
eukaryotes), stop codons (usually TAG, TGA, or
TAA), donor sites (usually GT), and acceptor
sites (AG).
5The Stochastic Nature of Signal Motifs
(start codons)
(stop codons)
T G A
T A A
A T G
T A G
(donor splice sites)
(acceptor splice sites)
A G
G T
6Signal Sensors
Signal sensor a function to identify specific
locations in the DNA where a signal is likely to
reside.
GGCTAGTCATGCCAAACGCGG AAACCTAGTATGCCCACGTTGT
ACCCAGTCCCATGACCACACACAACC
ACCCTGTGATGGGGTTTTAGAAGGACTC
GGCTAGTCATGCCAAACGCGG AAACCTAGTATGCCCACGTTGT
ACCCAGTCCCATGACCACACACAACC
ACCCTGTGATGGGGTTTTAGAAGGACTC
- Given a signal X of fixed length ?, estimate the
distributions - p(X) the probability that X is a true signal
- p-(X) the probability that X is a false signal
7Identifying Signals In DNA with a Signal Sensor
Identifying Signals In DNA with a Signal Sensor
We slide a fixed-length model or window along
the DNA and evaluate score(signal) at each point
Compute the score of the signal
When the score is greater than some threshold
(determined empirically to result in a desired
sensitivity), we remember this position as being
the potential site of a signal.
8Computational methods for signal prediction
- There are many methods to estimate the
probability of the sequence X - consensus analysis
- Markov models
- weight matrices
- decision trees
- neural networks
-
9Markov Chains
A Markov chain is a sequence of random variables
X1, X2, X3, ... with the Markov property.
Having the Markov property means that, given the
present state, future states are independent of
the past states
Recall that products of large numbers of
probabilities tend to cause underflow in the
computer working in log-space is thus
recommended (regardless of the order of the
model).
10Training Markov Chains
Training the Markov chain we need only count the
(n1)-mer occurrences and convert these to
conditional probabilities via simple
normalization where C(g0...gn) is the
number of times the (n1)-mer Gg0...gn was seen
in the set of training features for this sensor.
For large n, the sample sizes for these
estimations can be small, leading to sampling
error gt Interpolated Markov Models (IMMs).
11Interpolated Markov Chains
For higher-order chains we can interpolate
between orders to mitigate the effects of
sampling error
where ?k is a constant depending on the sample
size of the k-mer.
12Weight Matrices
The most common signal sensor is the Weight
Matrix Model (WMM)
A
T
G
A
T
C
C
consensus region
A weight matrix is simply a fixed window in which
each cell of the window has a nucleotide
distribution. The consensus region of the WMM is
the span of cells in which a valid consensus for
the signal of interest is expected to occur.
Evaluating a WMM at a given position in a DNA
sequence is as simple as looking up each
nucleotide falling within the window in the
position-specific base distribution associated
with the corresponding cell of the sensors
window, and multiplying these together
C
P(CTATGACstart codon)
C
T
A
T
G
A
0.21
x 0.32
x 0.26
x 1
x 1
x 1
x 0.19
0.00331968
13Weight Array Matrices
Just as we generalized HMMs and Markov chains to
higher orders, we can utilize higher-order
distributions in a WMM.
Given a model M of length LM, we can condition
each cell in the sensors window on some number n
of previous bases observed in the sequence
This is called a weight array matrix (WAM).
A variant, called the windowed WAM, or WWAM
(Burge, 1998), pools counts over several
positions when training the model, in order to
reduce the incidence of sampling error
14Data size for Markov models parameter estimation
Parameter estimation error (coefficient of
variation) as a function of symbol probability,
pi , and sample size, na
a A measure of the error in the estimation of a
Markov transition probability, PxiW , by the
corresponding conditional frequency, fW,xi /fW ,
is given as a function of n, the count of the
pattern W which is being conditioned on, and pi ,
the true (conditional) probability of the
nucleotide xi. The measure used is the
coefficient of variation. (From Burge 1997,
Ph.D. thesis)
Example. The transition probabilities in a first
order WAM are conditional on a nucleotide at each
position. Given a set of 1200 sequences, the
sample size n is 1200/4 300 for the estimation
of the 4 transitions conditional on a given
nucleotide. This results in estimation errors in
the range of 10 to 20.
15Maximal Dependence Decomposition (MDD)
A special type of signal sensor based on decision
trees was introduced by the program GENSCAN
(Burge, 1998).
Rather than using one weight array matrix for all
splice sites, MDD differentiates between splice
sites in the training set based on the bases
around the AG/GT consensus.
Starting at the root of the tree, we apply
predicates over base identities at positions in
the sensor window, which determine the path
followed as we descend the tree.
At the leaves of the tree are weight matrices
specific to signal variants as characterized by
the predicates in the tree. Therefore, each leaf
has a different WAM trained from a different
subset of splice sites.
16MDD method
Constructing the MDD tree is done by performing a
number of ?2 tests of independence between cells
in the sensor window. At each bifurcation in the
tree we select the predicate of maximal
dependence.
- Dependencies among non-adjacent position are
captured by the chi-square statistic between the
consensus Ki at position i, and the nucleotide Nj
at position j, i ¹ j.
- If strong dependencies are detected (c2 ³ 16.3,
for the cutoff P0.001, 3df), then partition the
data into two subsets one containing the
sequences with the consensus Ki , and another one
with the remaining sequences.
17MDD splitting criterion
MDD uses the ?2 measure between the variable Ki
representing the consensus at position i in the
sequence and the variable Nj which indicates the
nucleotide at position j where Ox,y is the
observed count of the event that Ki x and Nj y,
and Ex,y is the value of this count expected
under the null hypothesis that Ki and Nj are
independent
Example
position5 consensus-2
position
consensus
?2 2.9
18MDD decision trees in A.thaliana
Each leaf has the signal sensor trained from a
different subset of the splice sites. The tree
is induced empirically for each genome.
19Signal Sensors in the Gene Finding Context
Signals (ATGstart codon TAGany of the three
stop codons GTdonor splice site AGacceptor
splice site) can be used to represent the syntax
of eukaryotic genes . Gene syntax rules (for
forward-strand genes) can then be stated very
compactly
20Efficient Decoding via Signal Sensors
sensor n
. . .
ATGs
. . .
insert into type-specific signal queues
signal queues
sensor 2
GTS
sensor 1
AGs
sequence
GCTATCGATTCTCTAATCGTCTATCGATCGTGGTATCGTACGTTCATTAC
TGACT...
detect putative signals during left-to-right pass
over squence
After identifying the most promising (i.e.,
highest-scoring) signals in an input sequence, we
can apply the gene syntax rules to connect these
into a graph
...ATG.........ATG......ATG..................GT
newly detected signal
elements of the ATG queue
21Representing Gene Syntax with ORF Graphs
.0045 .0032 .0031 .0021 .002 .0072
.0023 .0034 .0082
An ORF graph represents all possible gene parses
for a given set of putative signals.
A path through the graph represents a single gene
parse.
The number of potential gene models grows
exponentially with the number of predicted
signals.
22The Notion of Eclipsing
ATGGATGCTACTTGACGTACTTAACTTACCGATCTCT
ATGGATGCTACTTGACGTACTTAACTTACCGATCTCT
0 1 2 0 1 2 0 1 20 1 2 0 1 20 1 2 0 1 20 1 20 1
20 1 20 1 2 0 1 2 0
in-frame stop codon!
23Eclipsing of Signals
24GeneSplicer
25Splice site prediction with GeneSplicer
16bp
24bp
- The splice site score is a combination of
- first or second order inhomogeneous Markov
models on windows around the acceptor and donor
sites - MDD decision trees
- longer Markov models to capture difference
between coding and non-coding on opposite sides
of site (optional) - maximal splice site score within 60 bp (optional)
26Splice Site Scoring
Donor/Acceptor sites at location k DS(k)
Scomb(k,16) (Scod(k-80)-Snc(k-80))
(Snc(k1)-Scod(k1)) AS(k) Scomb(k,24)
(Snc(k-80)-Scod(k-80)) (Scod(k1)-Snc(k1)) Sc
omb(k,i) score computed by the Markov model/MDD
method using window of i bases Scod/nc(j) score
of coding/noncoding Markov model for 80bp window
starting at j
27Coding-noncoding Boundaries
A key observation regarding splice sites and
start and stop codons is that all of these
signals delimit the boundaries between coding and
noncoding regions within genes (although the
situation becomes more complex in the case of
alternative splicing). One might therefore
consider weighting a signal score by some
function of the scores produced by the coding and
noncoding content sensors applied to the regions
immediately 5? and 3? of the putative signal
28Local Optimality Criterion
When identifying putative signals in DNA, we may
choose to completely ignore low-scoring
candidates in the vicinity of higher-scoring
candidates. The purpose of the local optimality
criterion is to apply such a weighting in cases
where two putative signals are very close
together, with the chosen weight being 0 for the
lower-scoring signal and 1 for the higher-scoring
one.
29Evaluation Metrics for Signal Prediction
all consensus sites
all true sites in genome
false positives (FP)
true positives (TP)
false negatives (FN)
predictions
true negatives (TN)
Sensitivity
Specificity
30Trade-off between False-Positive Rates and
False-Negative Rates
Arabidopsis thaliana data
31GeneSplicers Accuracy on Different Species
Other species for which a GeneSplicer training
can be fast made available Aspergillus
fumigatus, Schistosoma mansoni, Brugia malayi.
32Splice site prediction problems
- Splice sites do not always conform to consensus
(GC donor sites about 0.5, AT-AC introns
about 0.1).
- Internal exons can be arbitrarily small
(troponin1 has internal exons of 4 nt. in
mammals, and 7 nt. in birds).
- Signals far from splice sites can affect splicing.
33Exon Splicing Enhancers (ESEs)
Splicing requires more motifs then just canonical
splice sites!
Splicing enhancer sequences stimulate splicing
nearby by interactions with splicing factors such
as SR proteins
Splicing enhancers are described by short highly
degenerate motifs (6 to 9 bases in length)
Can compensate for weak splice sites
Some enhancer-independent pre-mRNAs have
multiple redundant enhancers.
34Relative hexamer frequencies in exons vs. introns
Approach 1 Word count statistics
Most frequent 9mers
35Approach 2 motif prediction
Input N DNA sequences x1, ., xN, a motif
length k, and a background model B (a set of
nucleotide probabilities in the sequence) Output
A model M which gives the nucleotide
probabilities in each position in a motif, and
the starting motif positions a1, . aN, in each
of the input sequences x1, ..xN which maximizes
the log-odds likelihood ratio given by
36Gibbs sampling algorithm
- Step 1. Initialization
- Select random locations ai, , aN in the
sequences x1, , xN. - Compute an initial model M from these locations.
- Step 2. Sampling Iterations (Predictive Update)
- Remove a sequence x xi
- Recalculate model M
- Pick a new location of motif in xi according to
probability that the location is a motif
occurrence. - Step 3. Repeat step 2 until convergence.
37Estimated Locations of Pattern Hits
- Gibbs sampling-based motif finder
- Input multi-FASTA file of DNA or amino acid
sequences - Output alignment of sequences around the most
common pattern - Open source free to all, freely redistributable
38ELPH Alignment
TGAAGA
39Motif consensus (12,300 exons)
5 end
3 end
40Multiple ESE Candidates
- do
- Run Gibbs sampler
- Compute motif consensus
- Remove all sequences that match best the motif
- while (no more significant motifs found)
41Approach 3 Use of recently duplicated genes
- 4265 paralogs resulting from large-scale
segmental duplications in Arabidopsis - 9580
exons gt50bp in length that align well between
paralogs (Blast E-value lt 10-5)
42Conserved 6-mers in recent paralogs
Statistically significant conservation at 3rd
codon position
- 5end
- CTGAAG GAAGAA GTGAAG
- GTTGCA TGAAGA TTGAAG
- TTGATG TTGGAG AAGAAA
- AAGAAG AAGCTG AAGGAA
- AAGTTG AGGAAA AGGAAG
- ATACAA ATGAAG ATGCTT
- ATGGAG CTACAA GAAACA
- GATTTT GCAAAG GCTTTT
- GTGAAA GTGGAA GTGGAG
- GTGGCT GTTCAT GTTGCA
- GTGGCT GTTTCA GTTTCT
- TGCTGC TGCTTC TGCTTG
- TGGAAG TGGAGA TGTTGA
- TTCAGC TTCTGC TTCTTG
- TTGCTG TTGGTG TTTCAG
- TTTCTG TTTCTT TTTGAT
- TTTTCT TTTTGA
3end CTTCAA TCAAGA TCTTGA AAAAAG AAAACA
AAAAGA AACAAA AACAAG AAGAAA AAGAAG ATGAAG
CAAGAA CACAAG CAGCTT CTCCAA CTGCAA CTGTCT
CTTGAA CTTGAT CTTTTC GAAGAA GAAGCT GATCCT
GCTCCT TCACCA TCTTGA TGAAGA TGGATG TTCTGA
TTCTTT TTGAAG TTGCTG TTGTTG TTTCAT TTTCTC
TTTCTG TTTGCT TTTGGA TTTGGT TTTTCT
CTGAAG GAAGAA GTGAAG GTTGCA TGAAGA TTGAAG TTGATG
TTGGAG
CTTCAA TCAAGA TCTTGA
43Synonymous mutation ratesESEs vs. background
5end
3end
GAAGAA
TCAAGA
44ESE experiments
- Top-scoring ESEs from computational analysis
being tested for in vitro activity - Experiments at U. Maryland by Steve Mount, Caren
Chang, and colleagues - ESEs also being tested for incorporation into
Arabidopsis gene finders
45Results
46(No Transcript)
47(No Transcript)
48GeneSplicerESE
a)
b)
Sensitivity versus specificity rates for
GeneSplicer and GeneSplicerESE Sensitivity is
defined as the fraction of all true splice sites
found by the splice site predictor specificity
is the fraction of the predicted elements
labelled correctly as splice sites. Rates are
shown for a) donor sites (GS don and GSESE don),
and b) acceptor sites (GS acc and GSESE acc).
Results are obtained using a 5-fold
cross-validation procedure on the ESEAra data
set. Weight matrices for the selected motifs to
describe each of the splice sites were recomputed
on each training data set from the 5 partitions
of the CV procedure.