Title: ESI 2004 OPTIMIZATION AND DATA MINING
1 ESI 2004 OPTIMIZATION AND DATA
MINING
- CLASSIFICATION OF FOLDING TYPES
- IN PROTEINS USING MILP
- FADIME ÜNEY
- (funey_at_ku.edu.tr)
- Supervised by METIN TÜRKAY
- (mturkay_at_ku.edu.tr)
- KOÇ UNIVERSITY, ISTANBUL
2AGENDA
- Proteins
- Structures of proteins
- Classification of folding types
- Folding type prediction
- Propositional Logic
- Model (MILP Formulation)
- Parameters
- Variables
- Objective function
- Constraints
- Illustrative Examples
- Training Set Results
- Future Research
3PROTEINS
- Bones, muscles, skin and hair of organisms
- Used in structure of cells
- Required for proper functioning and regulation of
organisms such as enzymes, hormones, antibodies - Amino acids ? PROTEINS
- Molecules of life
4CHEMICAL STRUCTURE OF AMINO ACIDS
- Distinguishing feature
- Different R groups
Carboxyl group
Side Chain
Amino group
5CLASSIFICATION OF AMINO ACID
6PEPTIDE BOND
Repeating units NC? C NC? C
O
O
7PROTEINS
- Since part of the amino acid is lost during
dehydration synthesis - units of a protein ? amino acid residues
- Typicall protein 200 -300 residues
- may increase up to 27 000
- Residue content and order is unique for each
protein - Sequence and types of side chains determine
- 3D shape, chemical and biological functions
8STRUCTURES OF PROTEINS
9PRIMARY STRUCTURE
- Sequence of amino acids
- A C M V I I C E V
- No arrangement of peptide bonds
- No angles between chemical bonds
- No interactions between any parts of residues
- Amino acid content and order dictates
- Shape of protein molecule
- Its spatial and biochemical properties
10SECONDARY STRUCTURE
- Local spatial arrangement of its main chain atoms
- Without regard to conformation of its side chains
- Without relationship with other segments
- Types of secondary structures
- a-helices
- ß-sheets
- Loops, turns and coils
11CLASSIFICATION OF FOLDING TYPES IN PROTEINS
ALL-ALPHA ? a-helices 40 and ß-sheets 5
(a) ALL-BETA ? a-helices 5 and ß-sheets 40
(ß) ALPHABETA ? a-helices 15 and ß-sheets
15 (aß) (60 antiparallel) ALPHA/BETA ?
a-helices 15 and ß-sheets 15 (a/ß) (60
parallel)
12FOLDING TYPE PREDICTION
- Functions of proteins ? study of fundamental
biological processes - Genetic engineering
- In case of human ? DESIGN OF DRUGS
- Experimental methods ? slow, require large
amounts of resources - Focal research subject in computational biology
and bioinformatics
13FOLDING TYPE PREDICTION
- Folding type of a protein depends on amino acid
composition, Nakashima et. al., 1986 - Several methods studied
- Chou, 1995 (Component coupled, 95.3)
- Bahar et. al., 1997 (Singular Value
Decomposition, 81) - Cai Zhou, 200 (Neural Network, 89.2)
- Cai et. al., 2001 (Support Vector Machines,
93.2) - Properties of training and test set
14FOLDING TYPE PREDICTION
SVM
MIP
15PROPOSITIONAL LOGIC
- Express relationships among Boolean variables
- Boolean variables (True or False)
- Operators
- OR
- AND
- IMPLICATION
-
16MIXED-INTEGER LINEAR PROGRAMMING (MILP)
FORMULATION
- Indices
- i protein
- j chain of the protein (A, B, C,...)
- k folding type of the protein (a, ß, aß, a/ß)
- l box that encloses a number of data points
belonging to a type (1, 2, .., L) - m amino acid (1, 2, .., 20)
- n bound (lower, upper)
17MILP FORMULATION (cont.)
- Parameters
-
- compijm composition of amino acid m in the
subunit j of protein i -
- foldtypeijk folding type k of the subunit j of
protein i -
- compU a sufficiently large parameter
-
18MILP FORMULATION (cont.)
- Variables (binary)
- YBl existence of box l
- YBClk assignment of folding type k to box l
- YPBijl assignment of subunit j of protein i to
box l - YPCijk assignment of subunit j of protein i to
class k - YLlsm lower bound of box s is between the bounds
of box l for amino - acid m
- YUlsm upper bound of box s is between the bounds
of box l for amino - acid m
- YCls intersection of box l and box s
19MILP FORMULATION (cont.)
- Variables (continuous)
-
- Xlmn define bounds n for amino acid m in box l
- XDlkmn define bounds n for amino acid m in box l
for class k - XP1ijk model misallocation of subunit j of
protein i to class k - XP2ijk model misallocation of subunit j of
protein i to class k
20MILP FORMULATION (cont.)
OBJECTIVE FUNCTION Minimize
Intersection
Misallocation
21MILP FORMULATION (cont.)
CONSTRAINTS Bounds for boxes
22MILP FORMULATION (cont.)
Relationship between protein-box and protein-class
Relationship between box-class
23MILP FORMULATION (cont.)
Relationship between protein-box-class
Misallocation
24MILP FORMULATION (cont.)
Intersection
l
s
l
s
25MILP FORMULATION (cont.)
Intersection
26MODELING ENVIRONMENT
27ILLUSTRATIVE EXAMPLES
ALP BET APB ASB
28ILLUSTRATIVE EXAMPLES
ALP BET APB ASB
29TRAINING SET
- Better training database
- A good quality of structure
- As many nonhomologous structures as possible
- A typical or distinguishable feature for each
class - PDB (Protein Data Bank)
- http//www.rcsb.org/pdb/
- SCOP (Structural Classification of Proteins)
- www.scop.mrc-lmb.cam.ac.uk/scop/
- 30 from each class
30 PDB code(Brookhaven National Labrotary)1ABA,
8ATC, etc.5th letter indicates the chain of
the protein
31RESULTS
- Very big problem ?
- Dual Simplex Method
- Iterative solution procedure ?
- Objective Function 0
- 14 Boxes
- 4 ALL ALPHA, 4 ALPHABETA
- 3 ALL BETA, 3 ALPHA/BETA
- Training accuracy 100
32FUTURE RESEARCH
- Test set
- 1600 proteins
- Jack-knife Test
- Prediction accuracy
- Model
- Distance-based classification
33(No Transcript)