Title: Sequence analysis of nucleic acids and proteins: part 2
1Sequence analysis of nucleic acids and proteins
part 2
Prediction of structure and function
- Based on Chapter 3 of
- Post-genome bioinformatics
- by Minoru Kanehisa
- Oxford University Press, 2000
2 Search and learning problems in sequence analysis
3Thermodynamic principle
- The amino acid sequence contains all the
information necessary to fold a protein molecule
into its native 3D state under physiological
conditions fold, denature, spontaneously refold,
called Anfinsens thermodynamic principle - Thus it should be possible to predict 3D
structure computationally by minimizing a
suitable conformational energy function, but
difficult to define, difficult to minimize
(globally), called ab initio - In practice, structures determined by X-ray
crystallography and nuclear magnetic resonance
(NMR) are used to give empirical
structure-function relationships.
4 A schematic illustration of RNA secondary
structure elements.
RNA secondary structure can be predicted ab
initio using an energy function and DP to
minimize it, in a process similar to alignment
Hairpin loop
Stem
Pseudo knot
Bulge loop
Internal loop
Branch loop
5 A C C A G.C C.G G.C G.U A.U U.A U.A
C U G ACAC A
G C
Yeast alanyl transfer RNA
6 The definition of a dihedral angle and the
three backbone dihedral angles, f, y, w, in a
protein. Because w is around 180O, the backbone
configuration can be specified by f and y, for
each peptide unit.
Prediction of protein secondary structure many
methods
C
f
Ca
C
O
H
H
N
R
H
R
Ca
N
C
Ca
y
w
f
N
C
N
C
Ca
H
H
O
R
H
O
Peptide unit
7Prediction of protein secondary structure
- The options are ?-helix, ?-strand and coil.
- Many 2º structure prediction methods exist, with
ones by Chou-Fasman and another due to
Garnier,Osguthorpe and Robson being widely used.
These are positionstructure-specific scoring
matrices based on modest or large numbers of
proteins. On the next page we display the GOR
PSSM for ?-helices. - These days one can choose from methods based on
almost every major machine learning approach
ANN, HMM, etc.
8a Helix State
Cter Nter
9 Two architectures of the hierarchical neural
network (a) the perceptron and (b) the
back-propagation neural network.
Input layer
Output layer
Input Layer
Hidden Layer
Output Layer
10Prediction of transmembrane domains
- Membrane proteins are very common, perhaps 25 of
all. Membranes are hydrophobic and so a
transmembrane domain typically has hydrophobic
residues, about 20 to span the membrane. -
- There are a number of rules for detecting them
Kyte-Doolittle hydropathy scores work fairly
well, and the Klein-Kanehisa-DeLisi
discriminant function does even better.
11Three-dimensional structures of two membrane
proteins
Photosynthetic reaction centre (PDB1PRC)
Outer membrane protein porin (PDB 1OMF)
12Hidden Markov Models (HMMs)
- S States s0,s1,..,sn
- V Output alphabet v0,v1,..,vm
- A aij transition probability from si
sj - B bi(j) probability outputting vj in state
si - What is the probability of a sequence of
observations? - What are the maximum likelihood estimates of
parameters in an HMM? - What is the most likely sequence of states that
produced a given sequence of observations?
13 A hidden Markov model for sequence analysis
mmatch state (output), Iinsert state (output),
ddelete state (no output)
14Prediction of protein 3D structures
- Knowledge based prediction of protein 3D or 3º
structure can be classified into two categories
comparative modelling and fold recognition. The
first can work well when there is significant
sequence similarity to a protein with known 3D
structure. By contrast, fold recognition is used
when no significant sequence similarity exists,
and makes use of the knowledge and analysis of
all protein structures. One such method due to
Eisenberg and colleagues, involves 3D-1D
alignment. Another such is threading.
15 The 3D-1D method for prediction of protein
3D structures involves the
construction of a library of 3D
profiles for the known protein structures.
Main chain
Side chain
Inside or outside
a
E
b
P1
Polar or apolar
B1
Environmental class
Residue number
B1a B1b B1 . . . .
1 2 3 . . . . . . . . . .
N
A R . . . . . Y W
-0.66 -0.79 -0.91 . . . . -1.67 -1.16
-2.16 . . . . . . .
. . . . .
. . . . .
. . 0.18 0.07 0.17 . .
. . 1.00 1.17 1.05 . . . .
A R . . . . . Y W
12 -66 46 . . . . . . . . . . -32
-80 -34 . . . . . . . . . . . .
. . . . . .
. . . . . .
. -94 112 -210 . . . . . . . . .
. -214 102 -135 . . . . . . . . . .
Amino acids
3D-1D score
3D profile
16Gene Structure I
DNA - - - - agacgagataaatcgattacagtca - - - -
Transcription
RNA - - - - agacgagauaaaucgauuacaguca - - - -
Splicing
Translation
Protein - - - - - DEI - - - -
Exon Intron Exon Intron Exon
Protein Folding Problem
Protein
17Gene Structure II
Exon 1
Exon 2
Exon 3
Exon 4
Intron 1
Intron 2
Intron 3
5
3
DNA
TRANSCRIPTION
pre-mRNA
SPLICING
mRNA
TRANSLATION
AUG - X1Xn - STOP
protein sequence
protein 3D structure
18Gene Structure III
Exon 1
Exon 2
Exon 3
Exon 4
DNA
Intron 1
Intron 2
Intron 3
5
3
polyA signal
Pyrimidine tract
Splice site GGTGAG
Promoter TATA
Splice site CAG
Branchpoint CTGAC
Translation Initiation ATG
Stop codon TAG/TGA/TAA
19Additional Difficulties
pre-mRNA
ALTERNATIVE SPLICING
SPLICING
mRNA
TRANSLATION
TRANSLATION
Protein I
Protein II
DNA
20Approaches to Gene Recognition
- Homology
- BLASTN, TBLASTX,
- Procrustes
- Statistical de novo
- GRAIL, FGENEH, Genscan, Genie, Glimmer
- Hybrid
- GenomeScan, Genie
F(,,,)
21Example GlimmerGene Finding in Microbial DNA
- No introns
- 90 coding
- Shorter genomes (less than 10 million bp)
- Lots of data
22Gene Structure in Prokaryotes
ORF
Translation Initiation ATG
Stop codon TAG/TGA/TAA
23Simplest Hidden Markov Gene Model
A 0.9 C 0.03 G 0.04 T 0.03
Coding
1
0.1
0.9
ATG
TAA
1
0.1
Intergene
A 0.25 C 0.25 G 0.25 T 0.25
0.9
24The Viterbi Algorithm
A A C A G T G A
C T C T
25Example GenscanGene Finding in Human DNA
- Introns
- 5 coding
- Large genome (3 billion bp)
- Alternative splicing
26The Genscan HMM
27 Examples of functional sites.
28Protein sorting prediction
- The final step in informational expression of
proteins involves their sorting to the
appropriate location within or outside the cell.
The information for correct localization is
usually located within the protein itself.
29(No Transcript)
30(No Transcript)
31Sequence Alignment Problem
- Task find common patterns shared by multiple
Protein sequences - Importance understanding function and
structures revealing evolutionary relationship,
data organizing - Types Pairwise vs. Multiple Global vs. Local.
- Approaches criteria-based (extension of
pairwise methods) versus model-based (EM, Gibbs,
HMM)
32Outline of Liu-Lawrence approach
- Local alignment --- Examples, the Gibbs sampling
algorithm - A simple multinomial model for block-motifs and
the Bayesian missing-data formulation. - Possible but not covered here
- Motif sampler repeated motifs.
- The hidden Markov model (its decoupling)
- The propagation model and beyond
33Example search for regulatory binding sites
- Gene Transcription and Regulation
- Transcription initiated by RNA polymerase binding
at the so-called promoter region (TATA-box or
-10, -35) - Regulated by some (regulatory) proteins on DNA
near the promoter region. - These binding sites on DNA are often similar in
composition.
RNA polymerase
Enhancers and repressors
Starting codon
3
5
AUG
Translation start
Promoter region
34(No Transcript)
35The particular dataset
- 18 DNA segments, each of length 105 bps.
- There are at least one CRP binding sites, known
experimentally, in each sequence. - The binding sites are about 16-19 base pairs
long, with considerable variability in their
contents. - Interested in seeing if we can find these sites
computationally.
36The Data Set
37Truth?
38Example H-T-H proteins
- HTH sequence-specific DNA binding, gene
regulation. - Motifs occur as local isolated structures. The
whole 3-D structures are known and very
different. - 30 sequences with known HTH positions chosen. The
set represents a typically diverse cross section
of HTH seq. - Width of the motif pattern is assumed to be in
the range from 17 to 22. The criterion
information per parameter is used to determine
the optimal width, 21. - Heuristic convergence developed (multiple
restarts with IPP monitored) - Finding
39(No Transcript)
40Local Alignment of Multiple Sequences
Local
Motif
a1
a2
width w
ak
length nk
Alignment variable Aa1, a2, , ak
Objective find the best common patterns.
41Motif Alignment Model
Motif
a1
a2
width w
ak
length nk
The missing data Alignment variable Aa1, a2,
, ak
- Every non-site positions follows a common
multinomial - with p0(p0,1 ,, p0,20)
- Every position i in the motif element
follows probability - distribution pi(pi,1 ,, pi,20)
42The Tricky Part The alignment variable Aa1,
a2, , ak is not observable
- General Missing Data problem
- Unobserved data in each datum
- Object of the DP optimization (path)
- Potentially observable
- Examples
- Alignment
- RNA structure
- Protein secondary structure
43Statistical Models
- How do we describe patterns?
- frequencies of amino acid types.
- multinomial distribution --- more generally a
model
A typical aligned motif
44Multinomial Distribution
A total of k sequences
Model Mi for i-th column
(ki,1, ki,2, , ki,20) Multinom (k, pi )
where pi(pi,1 ,, pi,20)
45 Estimation for the pattern
- The maximum likelihood
- Bayesian estimate
- Prior pi Dirichlet (ai,1, ..., ai,20),
pseudo-counts - Posterior pi obs Dirichlet (ai,1,ki,1,,
ai,20 ki,20) - Posterior Mean
- Posterior Distribution
46Dealing with the missing data
- Let Q(p0 , p1 , , pw ), parameter, Aa1,
a2, , aK - Iterative sampling P(Q A, Data) P(A Q,
Data) - Draw from Q A, Data, then draw from A Q,
Data - Predictive Updating pretend that K-1 sequences
have been aligned. We stochastically predict for
the K-th sequence!!
47The Algorithm
- Initialized by choosing random starting positions
- Iterate the following steps many times
- Randomly or systematically choose a sequence,
say, sequence k, to exclude. - Carry out the predictive-updating step to update
ak - Stop when not much change observed, or some
criterion met.
48The PU-Step
a1
a2
a3
ak ?
1. Compute predictive frequencies of each
position i in motif cij count of amino
acid type j at position i. c0j count
of amino acid type j in all non-site positions.
qij (cijbj)/(K-1B), Bb1 bK
pseudo-counts 2. Sample from the predictive
distriubtion of ak .
49Phase-shift and Fragmentation
- Sometimes get stuck in a local shift optimum
- How to escape from this local optimum?
- Simultaneous move A Ad, Ada1d, , aKd
- Use a Metropolis step accept the move with
probp,
Compare entropies between new columns and
left-out ones.
50Acknowledgements for slides used
- PDB protein figures
- Lior Pachter gene finding
- Jun Liu Gibbs sampler