Title: Discriminative Graphical Models for Structured Data Prediction
1Discriminative Graphical Models for Structured
Data Prediction
- Yan Liu
- Language Technologies Institute
- School of Computer Science
- Carnegie Mellon University
- Mar 30, 2006
2Structured Data Prediction
- Data in many applications have inherent structures
Example
Protein sequence Protein structures
3-D image Segmented objects
Text sequence Parsing tree
Input
John ate the cat .
SEQUENCEXSWGIKQLQARILX
Structures
Gp 41 core protein of HIV virus
- Fundamental importance in many areas
- Potential for significant theoretical and
practical advances
3Conditional Random Fields Lafferty et al, 2001
- Global normalization over undirected graphical
models - Model p (label yobservation x), not joint
distribution - Allow arbitrary dependencies in observation (e.g.
long range, overlapping) - Adaptive to different loss functions and
regularizers
4Conditional Random Fields (II)
- Promising results in
- Tagging (Collins, 2002) and parsing (Sha and
Pereira, 2003) - Information extraction (Pinto et al., 2003)
- Image processing (Kumar and Herbert, 2004)
5Conditional Random Fields (II)
- Promising results in
- Tagging (Collins, 2002) and parsing (Sha and
Pereira, 2003) - Information extraction (Pinto et al., 2003)
- Image processing (Kumar and Herbert, 2004)
- Recent developments
- Alternative estimation algorithms (Collins, 2002,
Dietterich et al, 2004) - Alternative loss functions, use of kernels
(Taskar et al., 2003, Altun et al, 2003,
Tsochantaridis et al, 2004) - Baysian formulation (Qi and Minka, 2005) and
semi-markov version (Sarawagi and cohen, 2004)
6Complex Structures with Long-range Dependencies
- Long-range dependencies
- The data points are dependent although they are
distant in the observed order - A property beyond the Markov assumptions
Computational challenge to the current Markov
models
ACEDLPGAEDWCEGGIKQLQARIL
Protein structure prediction
Co-reference resolution
Stock market change
7Outline
- Brief introduction to protein structures
- Conditional graphical models
- Segmentation CRF
- Chain graph model
- Dynamic segmentation CRF
- Conclusion and discussion
8Biology on One Slide
Nobelprize.org
Predict protein structures from sequences
9Protein Structure Hierarchy
10Protein Structure Hierarchy
- Focus on predicting the topology of the
structures from sequences
APAFSVSPASGACGPECA
11Previous Work
- General approaches
- Sequence similarity searches, e.g. PSI-BLAST
Altschul et al, 1997 - Profile HMM, .e.g. HMMER Durbin et al, 1998 and
SAM Karplus et al, 1998 - Homology modeling or threading, e.g. Threader
Jones, 1998 - Window-based methods, e.g. PSI_pred Jones, 2001
- Methods of careful design for specific
structures - Example aa- and ßß- hairpins, ß-turn and ß-helix
Efimov, 1991 Wilmot and Thornton, 1990 Bradley
at al, 2001
Structural similarity with less sequence
similarity (under 25) Long-range interactions
Hard to generalize
12Conditional Graphical Models
- Protein structural graph
- Undirected graphs
- Nodes the states of structural units or
segments (of fixed or variable lengths) Edges
interactions between units (local or long-range
dependencies) - Segmentation definition Y M, Wi
- M of segments
- Wi Si, Di, Si state of ith segment, Di
starting position of ith segment
Long-range interactions
Local interactions
13Conditional Graphical Models (II)
- Similar to CRFs, the conditional probability of a
possible segmentation Y given the observed
sequence x is defined as - Structural topology recognition is reduced to the
segmentation and labeling problem
14Conditional Graphical Models
1. How to define the structural units? 2. How to
get the long-range dependencies in the graph?
15Training and Testing Phase
- Training phase learn the model parameters
- Minimizing regularized log loss
- Iterative search algorithms by seeking the
direction whose empirical values agree with the
expectation - Testing phase search the segmentation that
maximizes P(wx)
3. How to make efficient inferences ? (will not
cover detail)
16Conditional Graphical Models for Protein
Structure Prediction
17Conditional Graphical Models for Protein
Structure Prediction
18Protein Secondary Structure Prediction
- Given a protein sequence, predict its secondary
structure assignments - Three classes helix (C), sheets (E) and coil (C)
APAFSVSPASGACGPECA CCEEEEECCCCCHHHCCC
19CGM on Secondary Structure PredictionLiu et al,
Bioinformatics 2004 Liu et al, BLC 2003
Lafferty et al, ICML2004
- Structural unit
- Individual residue
- Segmentation definition
- Y (n, Si), n of residues, Si H, E or C
- Models
- Conditional random fields (CRFs)
- Kernel conditional random fields (kCRFs)
- where
20Experiment Results on Prediction Combination
- Comparison methods
- Window-based label combination Rost and Sander,
1993 - Window-based score combination Jones, 1999
- Maximum entropy Markov model (MEMM) McCallum et
al, 2000 - Higher-order MEMMs (H-MEMM), Pseudo state
duration MEMMs (PSMEMM)
Graphical models are generally better than the
window-based approaches CRFs perform the best
among the four graphical models
21Conditional Graphical Models for Protein
Structure Prediction
22Structural motif recognition
- Structural motif
- Identifiable regular arrangement of secondary
structural elements - Super-secondary structure, or protein fold
Training sequences
Testing sequence
..APAFSVSPASGACGPECA.. Contains the structural
motif? ..NNEEEEECCCCCHHHCCC..
Yes
23CGM for Structural Motif Recognition
- Structural unit
- Secondary structure elements
- Protein structural graph
- Nodes for the states of secondary structural
elements of variable lengths - Edges for interactions between nodes in 3-D
- Example ß-a-ß motif
24CGM for Structural Motif RecognitionLiu,
Carbonell, Weigele and Gopalakrishnan, RECOMB
2005
- Segmentation definition
- Si state of segment i
- Di starting position of segment i
-
- Segmentation conditional random fields (SCRFs)
- For any graph, we have
25CGM for Structural Motif RecognitionLiu,
Carbonell, Weigele and Gopalakrishnan, RECOMB
2005
- Segmentation definition
- M number of segments
- Si state of segment i
- Di starting position of segment i
-
- Segmentation conditional random fields (SCRFs)
- For any graph, we have
- For a simplified graph, we have
- Efficient inferences, similar to the forward and
backward algorithm, can be derived for simplified
graph
26Structural Motif Recognition Complex Folds with
Structural Repeats
- Prevalent in proteins and important in functions
- Each rung consists of structural motifs and
insertions of variable lengths - Challenge
- Low sequence similarity in structural motifs
- Long-range interactions due to insertions
27CGM for Structural Motif Recognition Liu, Xing
and Carbonell, ICML 2005
- Chain graph
- A graph consisting of directed and undirected
graphs - Given a variable set V that forms multiple
subgraphs U, we have
- Two layer segmentation W M, ?i, T
- Level 1 Envelope one rung (repeat) with motifs
and insertions - Level 2 Motifs/insertions
- M of envelops
- ?i the segmentation of envelopes
- Ti the state of envelope (repeat, non-repeat)
28CGM for Structural Motif Recognition Liu, Xing
and Carbonell, ICML 2005
- Chain graph model
- SCRF subunits
- Zi only needs to be locally normalized
- Reduce the computational complexity dramatically
SCRFs
Motif profile model
29Experiments Structural Motif Recognition
- Right-handed ß-helix fold
- An helix-like structures with
- Three parallel ß-strands (B1, B2, B3 strands)
- T2 turn a conserved two-residue turn
- Low sequence identity (under 25)
- Bacterial pathogens during the infection of
plants, the cause of whooping cough - Leucine-rich repeats (LLR)
- Solenoid-like regular arrangement of beta-strands
and an alpha-helix, connected by coils - Relatively high sequence identity (many Leucines)
- A structural framework for protein-protein
interaction
30Experiment Results on SCRF for beta-helix
- Cross-family validation for classifying ß-helix
proteins - SCRFs can score all known ß-helices higher than
non ß-helices
31Experiment Results on SCRF for beta-helix
- Predicted Segmentation for known ß -helices on
cross-family validation
32Verification on Proteins with Recently
Crystallized Structures
- Successfully identify proteins from different
organisms - 1YP2 Potato Tuber ADP-Glucose Pyrophosphorylase
- score 10.47
- 1PXZ Jun A 1, The Major Allergen From Cedar
Pollen - score 32.35
- GP14 of Shigella bacteriophage as a ß-helix
protein with scoring 15.63
33Experiment Results on Chain Graph Model for
ß-helix
- Cross-family validation for classifying ß-helix
proteins
Chain graph model reduces the real running time
of SCRFs model by around 50 times
34Experiment Results on Chain Graph Model for LLR
- Cross-family validation for classifying LLR
proteins - Chain graph model can score all known LLR higher
than non-LLR
35Experiment Results on Chain Graph Model
- Predicted Segmentation for known ß-helices and
LLRs - (A) SCRFs (B) chain graph model
36Conditional Graphical Models for Protein
Structure Prediction
37Quaternary Structure Prediction
- Quaternary structures
- Multiple chains associated together through
noncovalent bonds or disulfide bonds - Very limited research work to date
- Complex structures
- Few positive training data
..APAFSVSPASGACGPECA.. Contains the quaternary
structures? ..NNEEEEECCCCCHHHCCC..
Yes
38Dynamic Segmentation CRF Liu, Carbonell,
Weigele and Gopalakrishnan, submitted 2006
- Structural building blocks
- Secondary structure elements
- Super-secondary structure elements
- Segmentation definition
- For each sequence, we define
- Yj ( Mj, Wj,i )
- Mj of segments in jthsequence j
- Wi,j a set of labels determining ith segment
- Inter-chain and intra-chain interactions
Van Raaij et al. in Nature (1999)
39Dynamic Segmentation CRF Liu, Carbonell,
Weigele and Gopalakrishnan, submitted 2006
- Dynamic segmentation CRF
- Exact inference is computationally infeasible
- Training reversible jump MCMC combined with
contrastive divergence - Testing reversible jump MCMC simulated annealing
40Experiment Quaternary structure prediction
- Triple beta-spirals
- Described by van Raaij et al. in Nature (1999)
- DNA virus and RNA virus
- Three proteins with crystallized structures and
about 20 without structure annotation - Characterized by unusual stability to heat,
protease, and detergent
41Experiment Quaternary structure prediction
Cross-family validation
Segmentation results
42Conditional Graphical Models for Protein
Structure Prediction
43Model Roadmap
Kernels
Conditional random fields
Kernel CRFs
Segmentation
Locally normalized
Segmentation CRFs
Chain graph model
Complex Structures involving multiple chains
Dynamic segmentation CRFs
44Model Roadmap
Generalized as conditional graphical models
Answer questions of 1. How to define the
structural building blocks? 2. How to get the
long-range dependencies in the graph?
45Discussion
- Conditional graphical models for structured data
prediction with the long-range dependencies - Long-range dependencies are common in many
applications - These dependencies can be effectively handled by
CGM given the guidance of expert knowledge - How about unsupervised learning or supervised
learning without guidance? - On-going work
- Bootstrap features using active learning
- Application to text mining from clinical reports
- Efficient inferences robust to large-scale
applications
46Other Projects
- Ensemble approaches for text classification and
genre classification - Boosting algorithm for noisy and inbalanced data
Liu et al, 2002, Jin et al, 2003 - Input-dependent Ensemble algorithms Liu et al,
2003, Yan et al, 2004, Liu et al, 2004 - Information extraction of hidden attributes from
product description - Harmonium graphical model for video data analysis
- Graph-based semi-supervised learning
- Protein-protein interaction prediction
47Acknowledgement
- Carnegie Mellon University
- Jaime Carbonell, Eric Xing, John Lafferty
- Yiming Yang, Chris Langmead, Roni Rosenfeld
- Rong Yan, Luo Si and other students in LTI
- Univ. of Pittsburgh
- Vanathi Gopalakrishnan, Judith Klein-Seetharaman,
Ivet Bahar - Massachusetts Institute of Technology
- Jonathan King, Peter Weigele, Bonnie Burger
- Michigan State Univ.
- Rong Jin
48(No Transcript)
49Features
Structural Motif Recognition
- Node features
- Regular expression template, HMM profiles
- Secondary structure prediction scores
- Segment length
- Inter-node features
- ß-strand Side-chain alignment scores
- Preferences for parallel alignment scores
- Distance between adjacent B23 segments
- Features are general and easy to extend
50Protein Structural Graph for Beta-helix
51Evaluation Measure
P P-
T P u
T - o n
- Q3 (accuracy)
- Precision, Recall
- Segment Overlap quantity (SOV)
- Matthews Correlation coefficients
52Local Information
- PSI-blast profile
- Position-specific scoring matrices (PSSM)
- Linear transformationKim Park, 2003
- SVM classifier with RBF kernel
- Feature1 (Si) Prediction score for each residue
Ri