Sequence analysis of nucleic acids and proteins: part 2 - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Sequence analysis of nucleic acids and proteins: part 2

Description:

The amino acid sequence contains all the information necessary to fold a protein ... under physiological conditions: fold, denature, spontaneously refold, called ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 51
Provided by: cen58
Category:

less

Transcript and Presenter's Notes

Title: Sequence analysis of nucleic acids and proteins: part 2


1
Sequence analysis of nucleic acids and proteins
part 2
Prediction of structure and function
  • Based on Chapter 3 of
  • Post-genome bioinformatics
  • by Minoru Kanehisa
  • Oxford University Press, 2000

2
Search and learning problems in sequence analysis
3
Thermodynamic principle
  • The amino acid sequence contains all the
    information necessary to fold a protein molecule
    into its native 3D state under physiological
    conditions fold, denature, spontaneously refold,
    called Anfinsens thermodynamic principle
  • Thus it should be possible to predict 3D
    structure computationally by minimizing a
    suitable conformational energy function, but
    difficult to define, difficult to minimize
    (globally), called ab initio
  • In practice, structures determined by X-ray
    crystallography and nuclear magnetic resonance
    (NMR) are used to give empirical
    structure-function relationships.

4
A schematic illustration of RNA secondary
structure elements.
RNA secondary structure can be predicted ab
initio using an energy function and DP to
minimize it, in a process similar to alignment
Hairpin loop
Stem
Pseudo knot
Bulge loop
Internal loop
Branch loop
5
A C C A G.C C.G G.C G.U A.U U.A U.A
C U G ACAC A
G C

Yeast alanyl transfer RNA
6
The definition of a dihedral angle and the
three backbone dihedral angles, f, y, w, in a
protein. Because w is around 180O, the backbone
configuration can be specified by f and y, for
each peptide unit.
Prediction of protein secondary structure many
methods
C
f
Ca
C
O
H
H
N
R
H
R
Ca
N
C
Ca
y
w
f
N
C
N
C
Ca
H
H
O
R
H
O
Peptide unit
7
Prediction of protein secondary structure
  • The options are ?-helix, ?-strand and coil.
  • Many 2º structure prediction methods exist, with
    ones by Chou-Fasman and another due to
    Garnier,Osguthorpe and Robson being widely used.
    These are positionstructure-specific scoring
    matrices based on modest or large numbers of
    proteins. On the next page we display the GOR
    PSSM for ?-helices.
  • These days one can choose from methods based on
    almost every major machine learning approach
    ANN, HMM, etc.

8
a Helix State
Cter Nter
9
Two architectures of the hierarchical neural
network (a) the perceptron and (b) the
back-propagation neural network.
Input layer
Output layer
Input Layer
Hidden Layer
Output Layer
10
Prediction of transmembrane domains
  • Membrane proteins are very common, perhaps 25 of
    all. Membranes are hydrophobic and so a
    transmembrane domain typically has hydrophobic
    residues, about 20 to span the membrane.
  • There are a number of rules for detecting them
    Kyte-Doolittle hydropathy scores work fairly
    well, and the Klein-Kanehisa-DeLisi
    discriminant function does even better.

11
Three-dimensional structures of two membrane
proteins
Photosynthetic reaction centre (PDB1PRC)
Outer membrane protein porin (PDB 1OMF)
12
Hidden Markov Models (HMMs)
  • S States s0,s1,..,sn
  • V Output alphabet v0,v1,..,vm
  • A aij transition probability from si
    sj
  • B bi(j) probability outputting vj in state
    si
  • What is the probability of a sequence of
    observations?
  • What are the maximum likelihood estimates of
    parameters in an HMM?
  • What is the most likely sequence of states that
    produced a given sequence of observations?

13
A hidden Markov model for sequence analysis
mmatch state (output), Iinsert state (output),
ddelete state (no output)
14
Prediction of protein 3D structures
  • Knowledge based prediction of protein 3D or 3º
    structure can be classified into two categories
    comparative modelling and fold recognition. The
    first can work well when there is significant
    sequence similarity to a protein with known 3D
    structure. By contrast, fold recognition is used
    when no significant sequence similarity exists,
    and makes use of the knowledge and analysis of
    all protein structures. One such method due to
    Eisenberg and colleagues, involves 3D-1D
    alignment. Another such is threading.

15
The 3D-1D method for prediction of protein
3D structures involves the
construction of a library of 3D
profiles for the known protein structures.
Main chain
Side chain
Inside or outside
a
E
b
P1
Polar or apolar
B1
Environmental class
Residue number
B1a B1b B1 . . . .
1 2 3 . . . . . . . . . .
N
A R . . . . . Y W
-0.66 -0.79 -0.91 . . . . -1.67 -1.16
-2.16 . . . . . . .
. . . . .
. . . . .
. . 0.18 0.07 0.17 . .
. . 1.00 1.17 1.05 . . . .
A R . . . . . Y W
12 -66 46 . . . . . . . . . . -32
-80 -34 . . . . . . . . . . . .
. . . . . .
. . . . . .
. -94 112 -210 . . . . . . . . .
. -214 102 -135 . . . . . . . . . .
Amino acids
3D-1D score
3D profile
16
Gene Structure I
DNA - - - - agacgagataaatcgattacagtca - - - -
Transcription
RNA - - - - agacgagauaaaucgauuacaguca - - - -
Splicing
Translation
Protein - - - - - DEI - - - -
Exon Intron Exon Intron Exon
Protein Folding Problem
Protein
17
Gene Structure II
Exon 1
Exon 2
Exon 3
Exon 4
Intron 1
Intron 2
Intron 3
5
3
DNA
TRANSCRIPTION
pre-mRNA
SPLICING
mRNA
TRANSLATION
AUG - X1Xn - STOP
protein sequence
protein 3D structure
18
Gene Structure III
Exon 1
Exon 2
Exon 3
Exon 4
DNA
Intron 1
Intron 2
Intron 3
5
3
polyA signal
Pyrimidine tract
Splice site GGTGAG
Promoter TATA
Splice site CAG
Branchpoint CTGAC
Translation Initiation ATG
Stop codon TAG/TGA/TAA
19
Additional Difficulties
  • Alternative splicing

pre-mRNA
ALTERNATIVE SPLICING
SPLICING
mRNA
TRANSLATION
TRANSLATION
Protein I
Protein II
  • Pseudo genes

DNA
20
Approaches to Gene Recognition
  • Homology
  • BLASTN, TBLASTX,
  • Procrustes
  • Statistical de novo
  • GRAIL, FGENEH, Genscan, Genie, Glimmer
  • Hybrid
  • GenomeScan, Genie

F(,,,)
21
Example GlimmerGene Finding in Microbial DNA
  • No introns
  • 90 coding
  • Shorter genomes (less than 10 million bp)
  • Lots of data

22
Gene Structure in Prokaryotes
ORF
Translation Initiation ATG
Stop codon TAG/TGA/TAA
23
Simplest Hidden Markov Gene Model
A 0.9 C 0.03 G 0.04 T 0.03
Coding
1
0.1
0.9
ATG
TAA
1
0.1
Intergene
A 0.25 C 0.25 G 0.25 T 0.25
0.9
24
The Viterbi Algorithm
A A C A G T G A
C T C T
25
Example GenscanGene Finding in Human DNA
  • Introns
  • 5 coding
  • Large genome (3 billion bp)
  • Alternative splicing

26
The Genscan HMM
27
Examples of functional sites.
28
Protein sorting prediction
  • The final step in informational expression of
    proteins involves their sorting to the
    appropriate location within or outside the cell.
    The information for correct localization is
    usually located within the protein itself.

29
(No Transcript)
30
(No Transcript)
31
Sequence Alignment Problem
  • Task find common patterns shared by multiple
    Protein sequences
  • Importance understanding function and
    structures revealing evolutionary relationship,
    data organizing
  • Types Pairwise vs. Multiple Global vs. Local.
  • Approaches criteria-based (extension of
    pairwise methods) versus model-based (EM, Gibbs,
    HMM)

32
Outline of Liu-Lawrence approach
  • Local alignment --- Examples, the Gibbs sampling
    algorithm
  • A simple multinomial model for block-motifs and
    the Bayesian missing-data formulation.
  • Possible but not covered here
  • Motif sampler repeated motifs.
  • The hidden Markov model (its decoupling)
  • The propagation model and beyond

33
Example search for regulatory binding sites
  • Gene Transcription and Regulation
  • Transcription initiated by RNA polymerase binding
    at the so-called promoter region (TATA-box or
    -10, -35)
  • Regulated by some (regulatory) proteins on DNA
    near the promoter region.
  • These binding sites on DNA are often similar in
    composition.

RNA polymerase
Enhancers and repressors
Starting codon
3
5
AUG
Translation start
Promoter region
34
(No Transcript)
35
The particular dataset
  • 18 DNA segments, each of length 105 bps.
  • There are at least one CRP binding sites, known
    experimentally, in each sequence.
  • The binding sites are about 16-19 base pairs
    long, with considerable variability in their
    contents.
  • Interested in seeing if we can find these sites
    computationally.

36
The Data Set
37
Truth?
38
Example H-T-H proteins
  • HTH sequence-specific DNA binding, gene
    regulation.
  • Motifs occur as local isolated structures. The
    whole 3-D structures are known and very
    different.
  • 30 sequences with known HTH positions chosen. The
    set represents a typically diverse cross section
    of HTH seq.
  • Width of the motif pattern is assumed to be in
    the range from 17 to 22. The criterion
    information per parameter is used to determine
    the optimal width, 21.
  • Heuristic convergence developed (multiple
    restarts with IPP monitored)
  • Finding

39
(No Transcript)
40
Local Alignment of Multiple Sequences
Local
Motif
a1
a2
width w
ak
length nk
Alignment variable Aa1, a2, , ak
Objective find the best common patterns.
41
Motif Alignment Model
Motif
a1
a2
width w
ak
length nk
The missing data Alignment variable Aa1, a2,
, ak
  • Every non-site positions follows a common
    multinomial
  • with p0(p0,1 ,, p0,20)
  • Every position i in the motif element
    follows probability
  • distribution pi(pi,1 ,, pi,20)

42
The Tricky Part The alignment variable Aa1,
a2, , ak is not observable
  • General Missing Data problem
  • Unobserved data in each datum
  • Object of the DP optimization (path)
  • Potentially observable
  • Examples
  • Alignment
  • RNA structure
  • Protein secondary structure

43
Statistical Models
  • How do we describe patterns?
  • frequencies of amino acid types.
  • multinomial distribution --- more generally a
    model

A typical aligned motif
44
Multinomial Distribution
A total of k sequences
Model Mi for i-th column
(ki,1, ki,2, , ki,20) Multinom (k, pi )
where pi(pi,1 ,, pi,20)
45
Estimation for the pattern
  • The maximum likelihood
  • Bayesian estimate
  • Prior pi Dirichlet (ai,1, ..., ai,20),
    pseudo-counts
  • Posterior pi obs Dirichlet (ai,1,ki,1,,
    ai,20 ki,20)
  • Posterior Mean
  • Posterior Distribution

46
Dealing with the missing data
  • Let Q(p0 , p1 , , pw ), parameter, Aa1,
    a2, , aK
  • Iterative sampling P(Q A, Data) P(A Q,
    Data)
  • Draw from Q A, Data, then draw from A Q,
    Data
  • Predictive Updating pretend that K-1 sequences
    have been aligned. We stochastically predict for
    the K-th sequence!!

47
The Algorithm
  • Initialized by choosing random starting positions
  • Iterate the following steps many times
  • Randomly or systematically choose a sequence,
    say, sequence k, to exclude.
  • Carry out the predictive-updating step to update
    ak
  • Stop when not much change observed, or some
    criterion met.

48
The PU-Step
a1
a2
a3
ak ?
1. Compute predictive frequencies of each
position i in motif cij count of amino
acid type j at position i. c0j count
of amino acid type j in all non-site positions.
qij (cijbj)/(K-1B), Bb1 bK
pseudo-counts 2. Sample from the predictive
distriubtion of ak .
49
Phase-shift and Fragmentation
  • Sometimes get stuck in a local shift optimum
  • How to escape from this local optimum?
  • Simultaneous move A Ad, Ada1d, , aKd
  • Use a Metropolis step accept the move with
    probp,

Compare entropies between new columns and
left-out ones.
50
Acknowledgements for slides used
  • PDB protein figures
  • Lior Pachter gene finding
  • Jun Liu Gibbs sampler
Write a Comment
User Comments (0)
About PowerShow.com