Sequence analysis of nucleic acids and proteins: part 2 - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Sequence analysis of nucleic acids and proteins: part 2

Description:

The amino acid sequence contains all the information necessary to fold a protein ... under physiological conditions: fold, denature, spontaneously refold, called ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 51

Provided by: cen58

Category:

more less

Transcript and Presenter's Notes

Title: Sequence analysis of nucleic acids and proteins: part 2

1
Sequence analysis of nucleic acids and proteins
part 2
Prediction of structure and function

Based on Chapter 3 of
Post-genome bioinformatics
by Minoru Kanehisa
Oxford University Press, 2000

2
Search and learning problems in sequence analysis
3
Thermodynamic principle

The amino acid sequence contains all the
information necessary to fold a protein molecule
into its native 3D state under physiological
conditions fold, denature, spontaneously refold,
called Anfinsens thermodynamic principle
Thus it should be possible to predict 3D
structure computationally by minimizing a
suitable conformational energy function, but
difficult to define, difficult to minimize
(globally), called ab initio
In practice, structures determined by X-ray
crystallography and nuclear magnetic resonance
(NMR) are used to give empirical
structure-function relationships.

4
A schematic illustration of RNA secondary
structure elements.
RNA secondary structure can be predicted ab
initio using an energy function and DP to
minimize it, in a process similar to alignment
Hairpin loop
Stem
Pseudo knot
Bulge loop
Internal loop
Branch loop
5
A C C A G.C C.G G.C G.U A.U U.A U.A
C U G ACAC A
G C

Yeast alanyl transfer RNA
6
The definition of a dihedral angle and the
three backbone dihedral angles, f, y, w, in a
protein. Because w is around 180O, the backbone
configuration can be specified by f and y, for
each peptide unit.
Prediction of protein secondary structure many
methods
C
f
Ca
C
O
H
H
N
R
H
R
Ca
N
C
Ca
y
w
f
N
C
N
C
Ca
H
H
O
R
H
O
Peptide unit
7
Prediction of protein secondary structure

The options are ?-helix, ?-strand and coil.
Many 2º structure prediction methods exist, with
ones by Chou-Fasman and another due to
Garnier,Osguthorpe and Robson being widely used.
These are positionstructure-specific scoring
matrices based on modest or large numbers of
proteins. On the next page we display the GOR
PSSM for ?-helices.
These days one can choose from methods based on
almost every major machine learning approach
ANN, HMM, etc.

8
a Helix State
Cter Nter
9
Two architectures of the hierarchical neural
network (a) the perceptron and (b) the
back-propagation neural network.
Input layer
Output layer
Input Layer
Hidden Layer
Output Layer
10
Prediction of transmembrane domains

Membrane proteins are very common, perhaps 25 of
all. Membranes are hydrophobic and so a
transmembrane domain typically has hydrophobic
residues, about 20 to span the membrane.
There are a number of rules for detecting them
Kyte-Doolittle hydropathy scores work fairly
well, and the Klein-Kanehisa-DeLisi
discriminant function does even better.

11
Three-dimensional structures of two membrane
proteins
Photosynthetic reaction centre (PDB1PRC)
Outer membrane protein porin (PDB 1OMF)
12
Hidden Markov Models (HMMs)

S States s0,s1,..,sn
V Output alphabet v0,v1,..,vm
A aij transition probability from si
sj
B bi(j) probability outputting vj in state
si
What is the probability of a sequence of
observations?
What are the maximum likelihood estimates of
parameters in an HMM?
What is the most likely sequence of states that
produced a given sequence of observations?

13
A hidden Markov model for sequence analysis
mmatch state (output), Iinsert state (output),
ddelete state (no output)
14
Prediction of protein 3D structures

Knowledge based prediction of protein 3D or 3º
structure can be classified into two categories
comparative modelling and fold recognition. The
first can work well when there is significant
sequence similarity to a protein with known 3D
structure. By contrast, fold recognition is used
when no significant sequence similarity exists,
and makes use of the knowledge and analysis of
all protein structures. One such method due to
Eisenberg and colleagues, involves 3D-1D
alignment. Another such is threading.

15
The 3D-1D method for prediction of protein
3D structures involves the
construction of a library of 3D
profiles for the known protein structures.
Main chain
Side chain
Inside or outside
a
E
b
P1
Polar or apolar
B1
Environmental class
Residue number
B1a B1b B1 . . . .
1 2 3 . . . . . . . . . .
N
A R . . . . . Y W
-0.66 -0.79 -0.91 . . . . -1.67 -1.16
-2.16 . . . . . . .
. . . . .
. . . . .
. . 0.18 0.07 0.17 . .
. . 1.00 1.17 1.05 . . . .
A R . . . . . Y W
12 -66 46 . . . . . . . . . . -32
-80 -34 . . . . . . . . . . . .
. . . . . .
. . . . . .
. -94 112 -210 . . . . . . . . .
. -214 102 -135 . . . . . . . . . .
Amino acids
3D-1D score
3D profile
16
Gene Structure I
DNA - - - - agacgagataaatcgattacagtca - - - -
Transcription
RNA - - - - agacgagauaaaucgauuacaguca - - - -
Splicing
Translation
Protein - - - - - DEI - - - -
Exon Intron Exon Intron Exon
Protein Folding Problem
Protein
17
Gene Structure II
Exon 1
Exon 2
Exon 3
Exon 4
Intron 1
Intron 2
Intron 3
5
3
DNA
TRANSCRIPTION
pre-mRNA
SPLICING
mRNA
TRANSLATION
AUG - X1Xn - STOP
protein sequence
protein 3D structure
18
Gene Structure III
Exon 1
Exon 2
Exon 3
Exon 4
DNA
Intron 1
Intron 2
Intron 3
5
3
polyA signal
Pyrimidine tract
Splice site GGTGAG
Promoter TATA
Splice site CAG
Branchpoint CTGAC
Translation Initiation ATG
Stop codon TAG/TGA/TAA
19
Additional Difficulties

Alternative splicing

pre-mRNA
ALTERNATIVE SPLICING
SPLICING
mRNA
TRANSLATION
TRANSLATION
Protein I
Protein II

Pseudo genes

DNA
20
Approaches to Gene Recognition

Homology
BLASTN, TBLASTX,
Procrustes
Statistical de novo
GRAIL, FGENEH, Genscan, Genie, Glimmer
Hybrid
GenomeScan, Genie

F(,,,)
21
Example GlimmerGene Finding in Microbial DNA

No introns
90 coding
Shorter genomes (less than 10 million bp)
Lots of data

22
Gene Structure in Prokaryotes
ORF
Translation Initiation ATG
Stop codon TAG/TGA/TAA
23
Simplest Hidden Markov Gene Model
A 0.9 C 0.03 G 0.04 T 0.03
Coding
1
0.1
0.9
ATG
TAA
1
0.1
Intergene
A 0.25 C 0.25 G 0.25 T 0.25
0.9
24
The Viterbi Algorithm
A A C A G T G A
C T C T
25
Example GenscanGene Finding in Human DNA

Introns
5 coding
Large genome (3 billion bp)
Alternative splicing

26
The Genscan HMM
27
Examples of functional sites.
28
Protein sorting prediction

The final step in informational expression of
proteins involves their sorting to the
appropriate location within or outside the cell.
The information for correct localization is
usually located within the protein itself.

29
(No Transcript)
30
(No Transcript)
31
Sequence Alignment Problem

Task find common patterns shared by multiple
Protein sequences
Importance understanding function and
structures revealing evolutionary relationship,
data organizing
Types Pairwise vs. Multiple Global vs. Local.
Approaches criteria-based (extension of
pairwise methods) versus model-based (EM, Gibbs,
HMM)

32
Outline of Liu-Lawrence approach

Local alignment --- Examples, the Gibbs sampling
algorithm
A simple multinomial model for block-motifs and
the Bayesian missing-data formulation.
Possible but not covered here
Motif sampler repeated motifs.
The hidden Markov model (its decoupling)
The propagation model and beyond

33
Example search for regulatory binding sites

Gene Transcription and Regulation
Transcription initiated by RNA polymerase binding
at the so-called promoter region (TATA-box or
-10, -35)
Regulated by some (regulatory) proteins on DNA
near the promoter region.
These binding sites on DNA are often similar in
composition.

RNA polymerase
Enhancers and repressors
Starting codon
3
5
AUG
Translation start
Promoter region
34
(No Transcript)
35
The particular dataset

18 DNA segments, each of length 105 bps.
There are at least one CRP binding sites, known
experimentally, in each sequence.
The binding sites are about 16-19 base pairs
long, with considerable variability in their
contents.
Interested in seeing if we can find these sites
computationally.

36
The Data Set
37
Truth?
38
Example H-T-H proteins

HTH sequence-specific DNA binding, gene
regulation.
Motifs occur as local isolated structures. The
whole 3-D structures are known and very
different.
30 sequences with known HTH positions chosen. The
set represents a typically diverse cross section
of HTH seq.
Width of the motif pattern is assumed to be in
the range from 17 to 22. The criterion
information per parameter is used to determine
the optimal width, 21.
Heuristic convergence developed (multiple
restarts with IPP monitored)
Finding

39
(No Transcript)
40
Local Alignment of Multiple Sequences
Local
Motif
a1
a2
width w
ak
length nk
Alignment variable Aa1, a2, , ak
Objective find the best common patterns.
41
Motif Alignment Model
Motif
a1
a2
width w
ak
length nk
The missing data Alignment variable Aa1, a2,
, ak

Every non-site positions follows a common
multinomial
with p0(p0,1 ,, p0,20)
Every position i in the motif element
follows probability
distribution pi(pi,1 ,, pi,20)

42
The Tricky Part The alignment variable Aa1,
a2, , ak is not observable

General Missing Data problem
Unobserved data in each datum
Object of the DP optimization (path)
Potentially observable
Examples
Alignment
RNA structure
Protein secondary structure

43
Statistical Models

How do we describe patterns?
frequencies of amino acid types.
multinomial distribution --- more generally a
model

A typical aligned motif
44
Multinomial Distribution
A total of k sequences
Model Mi for i-th column
(ki,1, ki,2, , ki,20) Multinom (k, pi )
where pi(pi,1 ,, pi,20)
45
Estimation for the pattern

The maximum likelihood
Bayesian estimate
Prior pi Dirichlet (ai,1, ..., ai,20),
pseudo-counts
Posterior pi obs Dirichlet (ai,1,ki,1,,
ai,20 ki,20)
Posterior Mean
Posterior Distribution

46
Dealing with the missing data

Let Q(p0 , p1 , , pw ), parameter, Aa1,
a2, , aK
Iterative sampling P(Q A, Data) P(A Q,
Data)
Draw from Q A, Data, then draw from A Q,
Data
Predictive Updating pretend that K-1 sequences
have been aligned. We stochastically predict for
the K-th sequence!!

47
The Algorithm

Initialized by choosing random starting positions
Iterate the following steps many times
Randomly or systematically choose a sequence,
say, sequence k, to exclude.
Carry out the predictive-updating step to update
ak
Stop when not much change observed, or some
criterion met.

48
The PU-Step
a1
a2
a3
ak ?
1. Compute predictive frequencies of each
position i in motif cij count of amino
acid type j at position i. c0j count
of amino acid type j in all non-site positions.
qij (cijbj)/(K-1B), Bb1 bK
pseudo-counts 2. Sample from the predictive
distriubtion of ak .
49
Phase-shift and Fragmentation