Computational Molecular Biology - PowerPoint PPT Presentation

1 / 168
About This Presentation
Title:

Computational Molecular Biology

Description:

Computational Molecular Biology Protein Structure: Introduction and Prediction – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 169
Provided by: MYTRATHAI
Category:

less

Transcript and Presenter's Notes

Title: Computational Molecular Biology


1
Computational Molecular Biology
  • Protein Structure Introduction and Prediction

2
Protein Folding
  • One of the most important problem in molecular
    biology
  • Given the one-dimensional amino-acid sequence
    that specifies the protein, what is the proteins
    fold in three dimensions?

3
Overview
  • Understand protein structures
  • Primary, secondary, tertiary
  • Why study protein folding
  • Structure can reveal functional information
    which we cannot find from the sequence
  • Misfolding proteins can cause diseases mad cow
    disease
  • Use in drug designs

4
Overview of Protein Structure
  • Proteins make up about 50 of the mass of the
    average human
  • Play a vital role in keeping our bodies
    functioning properly
  • Biopolymers made up of amino acids
  • The order of the amino acids in a protein and the
    properties of their side chains determine the
    three dimensional structure and function of the
    protein

5
Amino Acid
  • Building blocks of proteins
  • Consist of
  • An amino group (-NH2)
  • Carboxyl group (-COOH)
  • Hydrogen (-H)
  • A side chain group (-R) attached to the central
    a-carbon
  • There are 20 amino acids
  • Primary protein structure is a sequence of a
    chain of amino acids

Side chain
Aminogroup
Carboxylgroup
6
Side chains (Amino Acids)
  • 20 amino acids have side chains that vary in
    structure, size, hydrogen bonding ability, and
    charge.
  • R gives the amino acid its identity
  • R can be simple as hydrogen (glycine) or more
    complex such as an aromatic ring (tryptophan)

7
Chemical Structure of Amino Acids
8
How Amino Acids Become Proteins
Peptide bonds
9
Polypeptide
  • More than fifty amino acids in a chain are called
    a polypeptide.
  • A protein is usually composed of 50 to 400 amino
    acids.
  • We call the units of a protein amino acid
    residues.

amidenitrogen
carbonylcarbon
10
Side chain properties
  • Carbon does not make hydrogen bonds with water
    easily hydrophobic.
  • These water fearing side chains tend to
    sequester themselves in the interior of the
    protein
  • O and N are generally more likely than C to
    h-bond to water hydrophilic
  • Ten to turn outward to the exterior of the
    protein

11
(No Transcript)
12
Primary Structure
Primary structure Linear String of Amino Acids
Side-chain
Backbone
... ALA PHE LEU ILE LEU ARG ...
Each amino acid within a protein is referred to
as residues Each different protein has a unique
sequence of amino acid residues, this is its
primary structure
13
Secondary Structure
  • Refers to the spatial arrangement of contiguous
    amino acid residues
  • Regularly repeating local structures stabilized
    by hydrogen bonds
  • A hydrogen atom attached to a relatively
    electronegative atom
  • Examples of secondary structure are the ahelix
    and ßpleated-sheet

14
Alpha-Helix
  • Amino acids adopt the form of a right handed
    spiral
  • The polypeptide backbone forms the inner part of
    the spiral
  • The side chains project outward
  • every backbone N-H group donates a hydrogen bond
    to the backbone C  O group

15
Beta-Pleated-Sheet
  • Consists of long polypeptide chains called
    beta-strands, aligned adjacent to each other in
    parallel or anti-parallel orientation
  • Hydrogen bonding between the strands keeps them
    together, forming the sheet
  • Hydrogen bonding occurs between amino and
    carboxyl groups of different strands

16
Parallel Beta Sheets
17
Anti-Parallel Beta Sheets
18
Mixed Beta Sheets
19
Tertiary Structure
  • The full dimensional structure, describing the
    overall shape of a protein
  • Also known as its fold

20
Quaternary Structure
  • Proteins are made up of multiple polypeptide
    chains, each called a subunit
  • The spatial arrangement of these subunits is
    referred to as the quaternary structure
  • Sometimes distinct proteins must combine together
    in order to form the correct 3-dimensional
    structure for a particular protein to function
    properly.
  • Example the protein hemoglobin, which carries
    oxygen in blood. Hemoglobin is made of four
    similar proteins that combine to form its
    quaternary structure.

21
Other Units of Structure
  • Motifs (super-secondary structure)
  • Frequently occurring combinations of secondary
    structure units
  • A pattern of alpha-helices and beta-strands
  • Domains A protein chain often consists of
    different regions, or domains
  • Domains within a protein often perform different
    functions
  • Can have completely different structures and
    folds
  • Typically a 100 to 400 residues long

22
What Determines Structure
  • What causes a protein to fold in a particular
    way?
  • At a fundamental level, chemical interactions
    between all the amino acids in the sequence
    contribute to a proteins final conformation
  • There are four fundamental chemical forces
  • Hydrogen bonds
  • Hydrophobic effect
  • Van der Waal Forces
  • Electrostatic forces

23
Hydrogen Bonds
  • Occurs when a pair of nucliophilic atoms such as
    oxygen and nitrogen share a hydrogen between them
  • Pattern of hydrogen bounding is essential in
    stabilizing basic secondary structures

24
Van der Waal Forces
  • Interactions between immediately adjacent atoms
  • Result from the attraction between an atoms
    nucleus and it neighbors electrons

25
Electrostatic Forces
  • Oppositely charged side chains con form
    salt-bridges, which pulls chains together

26
Experimental Determination
  • Centralized database (to deposit protein
    structures) called the protein Databank (PDB),
    accessible at http//www.rcsb.org/pdb/index.html
  • Two main techniques are used to determine/verify
    the structure of a given protein
  • X-ray crystallography
  • Nuclear Magnetic Resonance (NMR)
  • Both are slow, labor intensive, expensive
    (sometimes longer than a year!)

27
X-ray Crystallography
  • A technique that can reveal the precise three
    dimensional positions of most of the atoms in a
    protein molecule
  • The protein is first isolated to yield a high
    concentration solution of the protein
  • This solution is then used to grow crystals
  • The resulting crystal is then exposed to an X-ray
    beam

28
Disadvantages
  • Not all proteins can be crystallized
  • Crystalline structure of a protein may be
    different from its structure
  • Multiple maps may be needed to get a consensus

29
NMR
  • The spinning of certain atomic nuclei generates a
    magnetic moment
  • NMR measures the energy levels of such magnetic
    nuclei (radio frequency)
  • These levels are sensitive to the environment of
    the atom
  • What they are bonded to, which atoms they are
    close to spatially, what distances are between
    different atoms
  • Thus by carefully measurement, the structure of
    the protein can be constructed

30
Disadvantages
  • Constraint of the size of the protein an upper
    bound is 200 residues
  • Protein structure is very sensitive to pH.

31
Computational Methods
  • Given a long and painful experimental methods,
    need computational approaches to predict the
    structure from its sequence.

32
Functional Region Prediction
33
Protein Secondary Structure
34
Tertiary Structure Prediction
35
More Details on X-ray Crystallography
36
Overview
37
Overview
38
Crystal
  • A crystal can be defined as an arrangement of
    building blocks which is periodic in three
    dimensions

39
Crystallize a Protein
  • Have to find the right combination of all the
    different influences to get the protein to
    crystallize
  • This can take a couple hundred or even thousand
    experiments
  • Most popular way to conduct these experiments
  • Hanging-drop method

40
Hanging drop method
  • The reservoir contains a precipitant
    concentration twice as high as the protein
    solution
  • The protein solutions is made up of 50 of stock
    protein solution and 50 of reservoir solution
  • Overtime, water will diffuse from the protein
    drop into the reservoir
  • Both the protein concentration and precipitant
    concentration will increase
  • Crystals will appear after days, weeks, months

41
Properties of protein crystal
  • Very soft
  • Mechanically fragile
  • Large solvent areas (30-70)

42
A Schematic Diffraction Experiment
43
Why do we need Crystals
  • A single molecule could never be oriented and
    handled properly for a diffraction experiment
  • In a crystal, we have about 1015 molecules in the
    same orientation so that we get a tremendous
    amplification of the diffraction
  • Crystals produce much simpler diffraction
    patterns than single molecules

44
Why do we need X-rays
  • X-rays are electromagnetic waves with a
    wavelength close to the distance of atoms in the
    protein molecules
  • To get information about where the atoms are, we
    need to resolve them -gt thus we need radiation

45
A Diffraction Pattern
46
(No Transcript)
47
Resolution
  • The primary measure of crystal order/quality of
    the model
  • Ranges of resolution
  • Low resolution (gt3-5 Ao) is difficult to see the
    side chains only the overall structural fold
  • Medium resolution (2.5-3 Ao)
  • High resolution (2.0 Ao)

48
Some Crystallographic Terms
  • h,k,l Miller indices (like a name of the
    reflection)
  • I(h,k,l) intensity
  • 2? angle between the x-ray incident beam and
    reflect beam

49
Diffraction by a Molecule in a Crystal
  • The electric vector of the X-ray wave forces the
    electrons in our sample to oscillate with the
    same wavelength as the incoming wave

50
Description of Waves
51
Structure Factor Equation
  • fj proportional to the number of electrons this
    atom j has
  • One of the fundamental equations in X-ray
    Crystallography

52
The Phase
  • From the measurement, we can only obtain the
    intensity I(hkl) of any given reflection (hkl)
  • The phase a(hkl) cannot be measured

53
How to Determine the Phase
  • Small changes are introduced into the crystal of
    the protein of interest
  • Eg soaking the crystal in a solution containing
    a heavy atom compound
  • Second diffraction data set needs to be
    collected
  • Comparing two data sets to determine the phases
    (also able to localize the heavy atoms)

54
Other Phase Determination Methods
55
Electron Density Map
  • Once we know the complete diffraction pattern
    (amplitudes and phases), need to calculate an
    image of the structure
  • The above equation returns the electron density
    (so we get a map of where the electrons are their
    concentration)

56
Interpretation of Electron Density
  • Now, the electron density has to be interpreted
    in terms of atom identities and positions.
  • (1) packing of the whole molecules is shown in
    the crystal
  • (2) a chain of seven amino acids in shown with
    the resulting structure superimposed
  • (3) the electron density of a trypophan side
    chain is shown

57
Refinement and the R-Factor
58
Nuclear Magnetic Resonance
  • Concentrated protein solution (very purified)
  • Magnetic field
  • Effect of radio frequencies on the resonance of
    different atoms is measured.

59
(No Transcript)
60
NMR
  • Behavior of any atom is influenced by neighboring
    atoms
  • more closely spaced residues are more perturbed
    than distant residues
  • can calculate distances based on perturbation

61
NMR spectrum of a protein
62
Computational Molecular Biology
  • Protein Structure Secondary Prediction

63
Primary Structure Symbolic Definition
  • A A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y
    set of symbols denoting all amino acids
  • A - set of all finite sequences formed out of
    elements of A, called protein sequences
  • Elements of A are denoted by x, y, z ..i.e. we
    write x? A, y? A, z?A, etc
  • PROTEIN PRIMARY STRUCTURE any x ? A is also
    called a protein sequence or protein sub-unit

64
Protein Secondary Structure (PSS)
  • Secondary structure the arrangement of the
    peptide backbone in space. It is produced by
    hydrogen bondings between amino acids
  • PROTEIN SECONDARY STRUCTURE consists of protein
    sequence and its hydrogen bonding patterns
    called SS categories

65
Protein Secondary Structure
  • Databases for protein sequences are expanding
    rapidly
  • The number of determined protein structures (PSS
    protein secondary structures) and the number of
    known protein sequences is still limited
  • PSSP (Protein Secondary Structure Prediction)
    research is trying to breach this gap.

66
Protein Secondary Structure
  • The most commonly observed conformations in
    secondary structure are
  • Alpha Helix
  • Beta Sheets/Strands
  • Loops/Turns

67
Turns and Loops
  • Secondary structure elements are connected by
    regions of turns and loops
  • Turns short regions of non-?, non-?
    conformation
  • Loops larger stretches with no secondary
    structure.

68
Three secondary structure states
  • Prediction methods are normally assessed for 3
    states
  • H (helix)
  • E (strands)
  • L (others (loop or turn))

69
Secondary Structure
  • 8 different categories
  • H ? - helix
  • G 310 helix
  • I ? - helix (extremely rare)
  • E ? - strand
  • B ? - bridge
  • T ?- turn
  • S bend
  • L the rest

70
Three SS states Reduction methods
  • Method 1, used by DSSP program
  • H(helix) G (310 helix), H (?- helix)
  • E (strands) E (?-strand), B (?-bridge) ,
  • L all the rest
  • Shortly E,B gt E G,H gt H Rest gt C
  • Method 2, used by STRIDE program
  • H as in Method 1
  • E E (?-strand), b (isolated ? -bridge),
  • L all the rest

71
Three SS states Reduction methods
  • Method 3, used by DEFINE program
  • H(helix) as in Method 1
  • E (strands) E (?-strand),
  • L all the rest

72
Example of typical PSS Data
  • Example
  • Sequence
  • KELVLALYDYQEKSPREVTHKKGDILTLLNSTNKDWWKYEYNDRQGFVP
  • Observed SS
  • HHHHHLLLLEEEHHHLLLEEEEEELLLHHHHHHHHLLLEEEEEELLLHHH

73
PSS Symbolic Definition
  • Given A A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,
    W,Y set of symbols denoting amino acids and a
    protein sequence x ? A
  • Let S H, E, L be the set of symbols of 3
    states H (helix), E (strands) and L (loop) and
    S be the set of all finite sequences of elements
    of S.
  • We denote elements of S by e, e? S

74
PSS Symbolic Definition
  • Any one-to-one function
  • f A? S i.e. f ? A x S
  • is called a protein secondary structure (PSS)
    identification function
  • An element (x, e) ? f is a called protein
    secondary structure (of the protein sequence x)
  • The element e ? S (of (x, e) ? f ) is called
    secondary structure.

75
PSSP
  • If a protein sequence shows clear similarity to a
    protein of known three dimensional structure
  • then the most accurate method of predicting the
    secondary structure is to align the sequences by
    standard dynamic programming algorithms
  • Why?
  • homology modelling is much more accurate than
    secondary structure prediction for high levels of
    sequence identity.

76
PSSP
  • Secondary structure prediction methods are of
    most use when sequence similarity to a protein of
    known structure is undetectable.
  • It is important that there is no detectable
    sequence similarity between sequences used to
    train and test secondary structure prediction
    methods.

77
Classification and Classifiers
  • Given a database table DB with a special
    atribute C, called a class attribute (or decision
    attribute). The values C1, C2, ...Cn of the
    class atrribute are called class labels.
  • Example

A1 A2 A3 A4 C
1 1 m g c1
0 1 v g c2
1 0 m b c1
78
Classification and Classifiers
  • The attribute C partitions the records in the DB
  • divides the records into disjoint subsets
    defined by the attributes C values, CLASSIFIES
    the records.
  • It means we use the attributre C and its values
    to divide the set R of records of DB into n
    disjoint classes
  • C1 r?DB Cc1 ...... Cnr?DB Ccn
  • Example (from our table)
  • C1 (1,1,m,g), (1,0,m,b) r1,r3
  • C2 (0,1,v,g) r2

79
Classification and Classifiers
  • An algorithm is called a classification algorithm
    if it uses the data and its classification to
    build a set of patterns.
  • Those patterns are structured in such a way that
    we can use them to classify unknown sets of
    objects- unknown records.
  • For that reason (because of the goal) the
    classification algorithm is often called shortly
    a classifier.
  • The name classifier implies more then just
    classification algorithm. A classifier is final
    product of a data set and a classification
    algorithm.

80
Classification and Classifiers
  • Building a classifier consists of two phases
  • training and testing.
  • In both phases we use data (training data set
    and disjoint with it test data set) for which the
    class labels are known for ALL of the records.
  • We use the training data set to create patterns
  • We evaluate created patterns with the use of of
    test data, which classification is known.
  • The measure for a trained classifier accuracy is
    called predictive accuracy.
  • The classifier is build i.e. we terminate the
    process if it has been trained and tested and
    predictive accuracy was on an acceptable level.

81
Classifiers Predictive Accuracy
  • PREDICTIVE ACCURACY of a classifier is a
    percentage of well classified data in the testing
    data set.
  • Predictive accuracy depends heavily on a choice
    of the test and training data.
  • There are many methods of choosing test and and
    training sets and hence evaluating the predictive
    accuracy. This is a separate field of research.

82
Accuracy Evaluation
  • Use training data to adjust parameters of method
    until it gives the best agreement between its
    predictions and the known classes
  • Use the testing data to evaluate how well the
    method works (without adjusting parameters!)
  • How do we report the performance?
  • Average accuracy fraction of all test examples
    that were classified correctly

83
Accuracy Evaluation
  • Multiple cross-validation test has to be
    performed to exclude a potential dependency of
    the evaluated accuracy on the particular test set
    chosen
  • Jack-Knife
  • Use 129 chains for setting up the tool (training
    set)
  • 1 for estimating the performance (testing)
  • This has to be repeated 130 times until each
    protein has been used once for testing
  • The average over all 130 tests gives an estimate
    of the prediction accuracy

84
PSSP Datasets
  • Historic RS126 dataset. Contains126 sub-units
    with known secondary structure selected by Rost
    and Sander. Today is not used anymore
  • CB513 dataset. Contains 513 sub-units with known
    secondary structure selected by Cuff and Barton
    in 1999. Used quite frencently in PSSP research
  • HS17771 dataset. Created by Hobohm and Scharf.
    In March-2002 it contained 1771 sub-units
  • Lots of authors has their own and secret
    datasets

85
Measures for PSSP accuracy
  • http//cubic.bioc.columbia.edu/eva/doc/measure_sec
    .html (for more information)
  • Q3 Three-state prediction accuracy (percent of
    succesful classified)
  • Qi obs How many of the observed residues were
    correctly predicted?
  • Qi prd How many of the predicted residues were
    correctly predicted?

86
Measures for PSSP Accuracy
  • Aij number of residues predicted to be in
    structure type j and observed to be in type i
  • Number of residues predicted to be in structure
    i
  • Number of residues observed to be in structure i

87
Measures for SSP Accuracy
  • The percentage of residues correctly predicted to
    be in class i relative to those observed to be in
    class i
  • The percentages of residues correctly predicted
    to be in class i from all residues predicted to
    be in i
  • Overall 3-state accuracy

88
PSSP Algorithms
  • There are three generations in PSSP algorithms
  • First Generation based on statistical
    information of single amino acids (1960s and
    1970s)
  • Second Generation based on windows (segments) of
    amino acids. Typically a window containes 11-21
    amino acids (dominating the filed until early
    1990s)
  • Third Generation based on the use of windows on
    evolutionary information

89
PSSP First Generation
  • First generation PSSP systems are based on
    statistical information on a single amino acid
  • The most relevant algorithms
  • Chow-Fasman, 1974
  • GOR, 1978
  • Both algorithms claimed 74-78 of predictive
    accuracy, but tested with better constructed
    datasets were proved to have the predictive
    accuracy 50 (Nishikawa, 1983)

90
Chou-Fasman method
  • Uses table of conformational parameters
    determined primarily from measurements of the
    known structure (from experimental methods)
  • Table consists of one likelihood for each
    structure for each amino acid
  • Based on frequencies of residues in a-helices,
    b-sheets and turns
  • Notation P(H) propensity to form alpha helices
  • f(i) probability of being in position 1 (of a
    turn)

91
Chou-Fasman Pij-values
92
Chou-Fasman
  • A prediction is made for each type of structure
    for each amino acid
  • Can result in ambiguity if a region has high
    propensities for both helix and sheet (higher
    value usually chosen)

93
Chou-Fasman
  • How it works
  • 1. Assign all of the residues the appropriate set
    of parameters
  • 2. Identify a-helix and b-sheet regions. Extend
    the regions in both directions.
  • 3. If structures overlap compare average values
    for P(H) and P(E) and assign secondary structure
    based on best scores.
  • 4. Turns are calculated using 2 different
    probability values.

94
Assign Pij values
1. Assign all of the residues the appropriate
set of parameters
95
Scan peptide for a-helix regions
2. Identify regions where 4 out of 6 have a
P(H) gt100 alpha-helix nucleus
96
Extend a-helix nucleus
3. Extend helix in both directions until a set of
four consecutive residues with P(H) lt100.
Find sum of P(H) and sum of P(E) in the extended
region If region is long enough ( gt 5 letters)
and sum P(H) gt sum P(E) then declare the extended
region as alpha helix
97
Scan peptide for b-sheet regions
4. Identify regions where 3 out of 5 have a
P(E) gt100 b-sheet nucleus 5. Extend b-sheet
until 4 continuous residues with an average P(E)
lt 100 6. If region average gt 100 and the
average P(E) gt average P(H) then b-sheet
98
Overlapping
  • Resolving overlapping alpha helix beta sheet
  • Compute sum of P(H) and sum of P(E) in the
    overlap.
  • If sum P(H) gt sum P(E) gt alpha helix
  • If sum P(E) gt sum P(H) gt beta sheet

99
Turn Prediction
  • An amino acid is predicted as turn if all of the
    following holds
  • f(i)f(i1)f(i2)f(i3) gt 0.000075
  • Avg(P(ik)) gt 100, for k0, 1, 2, 3
  • Sum(P(t)) gt Sum(P(H)) and Sum(P(E)) for ik,
    (k0, 1, 2, 3)

100
PSSP Second Generation
  • Based on the information contained in a window of
    amino acids (11-21 aa.)
  • The most systems use algorithms based on
  • Statistical information
  • Physico-chemical properties
  • Sequence patterns
  • Graph-theory
  • Multivariante statistics
  • Expert rules
  • Nearest-neighbour algorithms

101
PSSP First Second Generation
  • Main problems
  • Prediction accuracy lt70
  • SS assigments differ even between crystals of the
    same protein
  • SS formation is partially determined by
    long-range interactions, i.e., by contacts
    between residues that are not visible by any
    method based on windows of 11-21 adjacent residues

102
PSSP First Second Generation
  • Main problems
  • Prediction accuracy for b-strand 28-48, only
    slightly better than random
  • beta-sheet formation is determined by more
    nonlocal contacts than in alpha-helix formation
  • Predicted helices and strands are usually too
    short
  • Overlooked by most developers

103
Example of Second Generation
  • Example for typical secondary structure
    prediction of the 2nd generation.
  • The protein sequence (SEQ ) given was the SH3
    structure.
  • The observed secondary structure (OBS ) was
    assigned by DSSP (H helix E strand blank
    non-regular structure the dashes indicate the
    continuation).
  • The typical prediction of too short segments (TYP
    ) poses the following problems in practice.
  • (i) Are the residues predicted to be strand in
    segments 1, 5, and 6 errors, or should the
    helices be elongated?
  • (ii) Should the 2nd and 3rd strand be joined, or
    should one of them be ignored, or does the
    prediction indicate two strands, here? Note the
    three-state per-residue accuracy is 60 for the
    prediction given.

104
PSSP Third Generation
  • PHD First algorithm in this generation (1994)
  • Evolutionary information improves the prediction
    accuracy to 72
  • Use of evolutionary information
  • 1. Scan a database with known sequences with
    alignment methods for finding similar sequences
  • 2. Filter the previous list with a threshold to
    identify the most significant sequences
  • 3. Build amino acid exchange profiles based on
    the probable homologs (most significant
    sequences)
  • 4. The profiles are used in the prediction,
    i.e. in building the classifier

105
PSSP Third Generation
  • Many of the second generation algorithms have
    been updated to the third generation

106
PSSP Third Generation
  • Due to the improvement of protein information in
    databases i.e. better evolutionary information,
    todays predictive accuracy is 80
  • It is believed that maximum reachable accuracy is
    88. Why such conjecture?

107
Why 88
  • SS assignments may vary for two versions of the
    same structure
  • Dynamic objects with some regions being more
    mobile than others
  • Assignment differ by 5-15 between different
    X-ray (NMR) versions of the same protein
  • Assignment diff. by about12 between structural
    homologues
  • B. Rost, C. Sander, and R. Schneider, Redefining
    the goals of protein secondary structure
    predictions, J. Mol. Bio.

108
PSSP Data Preparation
  • Public Protein Data Sets used in PSSP research
    contain protein secondary structure sequences. In
    order to use classification algorithms we must
    transform secondary structure sequences into
    classification data tables.
  • Records in the classification data tables are
    called, in PSSP literature (learning) instances.
  • The mechanism used in this transformation process
    is called window.
  • A window algorithm has a secondary structure as
    input and returns a classification table set of
    instances for the classification algorithm.

109
Window
  • Consider a secondary structure (x, e).
  • where (x,e) (x1x2 xn, e1e2en)
  • Window of the length w chooses a subsequence of
    length w of x1x2 xn, and an element ei from
    e1e2en, corresponding to a special position in
    the window, usually the middle
  • Window moves along the sequences
  • x x1x2 xn and e e1e2en
  • simultaneously, starting at the beginning moving
    to the right one letter at the time at each step
    of the process.

110
Window Sequence to Structure
  • Such window is called sequence to structure
    window. We will call it for short a window.
  • The process terminates when the window or its
    middle position reaches the end of the sequence
    x.
  • The pair (subsequence, element of e ) is often
    written in a form
  • subsequence ? H, E or L
  • is called an instance, or a rule.

111
Example Window
  • Consider a secondary structure (x, e) and the
    window of length 5 with the special position in
    the middle (bold letters)
  • Fist position of the window is
  • x A R N S T V V S T A A .
  • e H H H H L L L E E E
  • Window returns instance
  • A R N S T ? H


112
Example Window
  • Second position of the window is
  • x A R N S T V V S T A A .
  • e H H H H L L L E E E
  • Windows returns instance
  • R N S T V ? H
  • Next instances are
  • N S T V V ? L
  • S T V V S ? L
  • T V V S T ? L


113
Symbolic Notation
  • Let f be a protein secondary structure (PSS)
    identification function
  • f A? S i.e. f ? A x S
  • Let x x1x2xn, e e1e2en, f(x) e, we define
  • f(x1x2xn)xi ei, i.e. f(x)xi ei

114
ExampleSemantics of Instances
  • Let
  • x A R N S T V V S T A A .
  • e H H H H L L L E E E
  • And assume that the windows returns an instance
  • A R N S T ? H
  • Semantics of the instance is
  • f(x)NH,
  • where f is the identification function and N is
    preceded by A R and followed by S T and the
    window has the length 5

115
Classification Data Base (Table)
  • We build the classification table with attributes
    being the positions p1, p2, p3, p4, p5 .. pw
  • in the window, where w is length of the
    window.
  • The corresponding values of attributes are
    elements of of the subsequent on the given
    position.
  • Classification attribute is S with values in the
    set H, E, L assigned by the window operation
    (instance, rule).
  • The classification table for our example (first
    few records) is the following.

116
Classification Table (Example)
  • x A R N S T V V S T A A .
  • e H H H H L L L E E E

p1 p2 p3 p4 p5 S
A R N S T H
R N S T V H
N S T V V L
S T V V S L
Semantics of record r r(p1, p2, p3,p4,p5, S) is
f(x)Vp3 Vs where Va denotes a value of
the attribute a.
117
Size of classification datasets (tables)
  • The window mechanism produces very large datasets
  • For example window of size 13 applied to the
    CB513 dataset of 513 protein subunits produces
    about
  • 70,000 records (instances)

118
Window
  • Window has the following parameters
  • PARAMETER 1 i ? N, the starting point of the
    window as it moves along the sequence x x1 x2
    . xn. The value i1 means that window starts
    at x1, i5 means that window starts at x5
  • PARAMETER 2 w ? N denotes the size (length)
    of the window.
  • For example the PHD system of Rost and Sander
    (1994) uses two window sizes 13 and 17.

119
Window
  • PARAMETER 3 p ? 1,2, , w
  • where p is a special position of the window
    that returns the classification attribute values
    from S H, E, L and w is the size (length) of
    the window
  • PSSP PROBLEM
  • find optimal size w, optimal special position
    p for the best prediction accuracy

120
Window Symbolic Definition
  • Window Arguments window parameters and secondary
    structure (x,e)
  • Window Value (subsequence of x, element of e)
  • OPERATION (sequence to structure window)
  • W is a partial function
  • W N ? N ? 1,, k ?(A ? S ) ? A ? S
  • W(i, k, p, (x,e)) (xi x(i1). x(ik-1),
    f(x)x(ip)) where (x,e) (x1x2 ..xn, e1e2en)

121
Neural network models
  • machine learning approach
  • provide training sets of structures (e.g.
    a-helices, non a -helices)
  • are trained to recognize patterns in known
    secondary structures
  • provide test set (proteins with known structures)
  • accuracy 70 75

122
Reasons for improved accuracy
  • Align sequence with other related proteins of the
    same protein family
  • Find members that has a known structure
  • If significant matches between structure and
    sequence assign secondary structures to
    corresponding residues

123
3 State Neural Network
124
Neural Network
125
Input Layer
  • Most of approach set w 17. Why?
  • Based on evidence of statistical correlation with
    secondary structure as far as 8 residues on
    either side of the prediction point
  • The input layer consists of
  • 17 blocks, each represent a position of window
  • Each block has 21 units
  • The first 20 units represent the 20 aa
  • One to provide a null input used when the moving
    window overlaps the amino- or carboxyl-terminal
    end of the protein

126
Binary Encoding Scheme
  • Example
  • Let w 5, and let say we have the sequence
  • A E G K Q.
  • Then the input layer is
  • A,C,D,E,F,G,,N,P,Q,R,S.T,V,W,Y
  • 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . 0 0
  • 0 0 1 0 ..
  • 0 0 1 0 ..

127
Hidden Layer
  • Represent the structure of the central aa
  • Encoding scheme
  • Can use two units to present
  • (1,0) H, (0,1) E, (0,0) L
  • Some uses three units
  • (1,0,0) H, (0,1,0) E, (0,0,1) L
  • For each connection, we can assign some weight
    value.
  • This weight value can be adjusted to best fit the
    data (training)

128
Output Level
  • Based on the hidden level and some function f,
    calculate the output.
  • Helix is assigned to any group of 4 or more
    contiguous residues
  • Having helix output values greater than sheet
    outputs and greater than some threshold t
  • Strand (E) is assigned to any group of two or
    more contiguous resides, having sheet output
    values greater than helix outputs and greater
    than t
  • Otherwise, assigned to L
  • Note that t can be adjusted as well (training)

129
How PHD works
  • Step 1. BLAST search with input sequence
  • Step 2. Perform multiple seq. alignment and
    calculate aa frequencies for each position

130
How PHD works
  • Step 3. First Level Sequence to structure net
  • Input alignment profile, Output units for H,
    E, L
  • Calculate occurrences of any of the residues
    to be present in either an a-helix, b-strand, or
    loop.

1 2 3 4 5 6 7
H 0.05 E 0.18 L 0.67
N0.2, S0.4, A0.4
131
How PHD works
  • Step 3. Second Level Structure to structure
    net
  • Input First Level values, Output units for H,
    E, L
  • Window size 17

H 0.59 E 0.09 L 0.31
E0.18
Step 4. Decision level
132
Prepare Data for PHD Neural Nets
  • Starting from a sequence of unknown structure
    (SEQUENCE ) the following steps are required to
    finally feed evolutionary information into the
    PHD neural networks
  • a data base search for homologues (method Blast),
  • a refined profile-based dynamic-programming
    alignment of the most likely homologues (method
    MaxHom)
  • a decision for which proteins will be considered
    as homologues (length-depend cut-off for pairwise
    sequence identity)
  • a final refinement, and extraction of the
    resulting multiple alignment. Numbers 1-3
    indicate the points where users of the
    PredictProtein service can interfere to improve
    prediction accuracy without changes made to the
    final prediction method PHD .
  • http//cubic.bioc.columbia.edu/papers/2000_rev_hum
    ana/paper.html

133
PHD Neural Network
134
Prediction Accuracy
135
Where can I learn more?
  • Protein Structure Prediction Center
  • Biology and Biotechnology Research
    ProgramLawrence Livermore National Laboratory,
    Livermore, CA
  • http//predictioncenter.llnl.gov/Center.html

DSSP Database of Secondary Structure
Prediction http//www.sander.ebi.ac.uk/dssp/
136
Computational Molecular Biology
  • Protein Structure Tertiary Prediction via
    Threading

137
Objective
  • Study the problem of predicting the tertiary
    structure of a given protein sequence

138
A Few Examples
actual
predicted
predicted
actual
actual
actual
predicted
predicted
139
Two Comparative Modeling
  • Homology modeling identification of homologous
    proteins through sequence alignment structure
    prediction through placing residues into
    corresponding positions of homologous structure
    models
  • Protein threading make structure prediction
    through identification of good
    sequence-structure fit
  • We will focus on the Protein Threading.

140
Why it Works?
  • Observations
  • Many protein structures in the PDB are very
    similar
  • Eg many 4-helical bundles, globins in the set
    of solved structure
  • Conjecture
  • There are only a limited number of unique
    protein folds in nature

141
Threading Method
  • General Idea
  • Try to determine the structure of a new sequence
    by finding its best fit to some fold in library
    of structures
  • Sequence-Structure Alignment Problem
  • Given a solved structure T for a sequence t1t2tn
    and a new sequence S s1s2 sm, we need to find
    the best match between S and T

142
What to Consider
  • How to evaluate (score) a given alignment of s
    with a structure T?
  • How to efficiently search over all possible
    alignments?

143
Three Main Approaches
  • Protein Sequence Alignment
  • 3D Profile Method
  • Contact Potentials

144
Protein Sequence Alignment Method
  • Align two sequences S and T
  • If in the alignment, si aligns with tj, assign si
    to the position pj in the structure
  • Advantages
  • Simple
  • Disadvantages
  • Similar structures have lots of sequence
    variability, thus sequence alignment may not be
    very helpful

145
3D Profile Method
  • Actually uses structural information
  • Main idea
  • Reduce the 3D structure to a 1D string describing
    the environment of each position in the protein.
    (called the 3D profile (of the fold))
  • To determine if a new sequence S belongs to a
    given fold T, we align the sequence with the
    folds 3D profile
  • First question How to create the 3D profile?

146
Create the 3D Profile
  • For a given fold, do
  • For each residue, determine
  • How buried is it?
  • Fraction of surrounding environment that is polar
  • What secondary structure is it in (alpha-helix,
    beta-sheet, or neither)

147
Create the 3D profile
  • 2. Assign an environment class to each position
  • Six classes describe the burial and polarity
    criteria (exposed, partially buried, very buried,
    different fractions of polar environment)

148
Create the 3D Profile
  • These environment classes depend on the number of
    surrounding polar residues and how buried the
    position is.
  • There are 3 SS for each of these, thus have 18
    environment classes

149
Create the 3D Profile
  • 3. Convert the known structure T to a string of
    environment descriptors
  • 4. Align the new sequence S with E using dynamic
    programming

150
Scores for Alignment
  • Need scores for aligning individual residues with
    environments.
  • Key Different aa prefer diff. environment. Thus
    determine scores by looking at the statistical
    data

151
Scores for Alignment
  1. Choose a database of known structures
  2. Tabulate the number of times we see a particular
    residue in a particular environment class -gt
    compute the score for each env class and each aa
    pair
  3. Choose gap penalties, eg. may charge more for
    gaps in alpha and beta environments

152
Alignment
  • This gives us a table of scores for aligning an
    aa sequence with an environment string
  • Using this scoring and Dynamic Programming, we
    can find an optimal alignment and score for each
    fold in our library
  • The fold with the highest score is the best fold
    for the new sequence

153
Contact Potentials Method
  • Take 3D structure into account more carefully
  • Include information about how residues interact
    with each other
  • Consider pairwise interactions between the
    position pi, pj in the fold
  • For a given alignment, produce a score which is
    the sum over these interactions

154
Problem
  • Have a sequence from the database T t1tn with
    known positions p1pn, and a new sequence S
    s1sm.
  • Find 1 lt r1 lt r2 lt lt rn lt m which maximize
  • where ri is the index of the aa in S which
    occupies position pi
  • This problem is NP-complete for pairwise
    interactions

155
How to Define that Score?
  • Use so-called knowledge-based potentials, which
    comes from databases of observed interactions.
  • The general form

156
How to Define the Score
  • General Idea
  • Define cutoff parameter for contact (e.g. up to
    6 Angstroms)
  • Use the PDB to count up the number of times aa i
    and j are in contact
  • Several method for normalization. Eg.
    Normalization is by hypothetical random
    frequencies

157
Other Variations
  • Many other variations in defining the potentials
  • In addition to pairwise potentials, consider
    single residue potentials
  • Distance-dependent intervals
  • Counting up pairwise contacts separately for
    intervals within 1 Angstrom, between 1 and 2
    Angstroms

158
Threading via Tree-Decomposition
159
Contact Graph
  1. Each residue as a vertex
  2. One edge between two residues if their spatial
    distance is within given cutoff.
  3. Cores are the most conserved segments in the
    template

template
160
Simplified Contact Graph
161
Alignment Example
162
Alignment Example
163
Calculation of Alignment Score

164
Graph Labeling Problem
  • Each core as a vertex
  • Two cores interact if there is an interaction
    between any two residues, each in one core
  • Add one edge between two cores that interact.

h
f
b
d
s
m
c
a
e
i
j
k
l
Each possible sequence alignment position for a
single core can be treated as a possible label
assignment to a vertex in G Di be a set of
all possible label assignments to vertex i. Then
for each label assignment A(i) in Di, we have
165
Tree Decomposition
166
Tree DecompositionRobertson Seymour, 1986
Greedy minimum degree heuristic
h
  1. Choose the vertex with minimum degree
  2. The chosen vertex and its neighbors form a
    component
  3. Add one edge to any two neighbors of the chosen
    vertex
  4. Remove the chosen vertex
  5. Repeat the above steps until the graph is empty

167
Tree Decomposition (Contd)
Tree Decomposition
168
Tree Decomposition-Based Algorithms
  • Bottom-to-Top Calculate the minimal F function
  • 2. Top-to-Bottom Extract the optimal assignment

A tree decomposition rooted at Xr
The score of component Xi
The scores of subtree rooted at Xl
The score of subtree rooted at Xi
The scores of subtree rooted at Xj
Write a Comment
User Comments (0)
About PowerShow.com