Title: Mining Patterns in Protein Structures Algorithms and Applications
1Mining Patterns in Protein Structures
Algorithms and Applications
MotifSpace
- Wei Wang
- UNC Chapel Hill
- weiwang_at_cs.unc.edu
2Proteins Are the Machinery of Life
Protein Structure Initiative
Function
Spatial motifs
Protein Data Bank
Serine protease
Papain-like Cysteine protease
GTP binding protein
3MotifSpace
protein classification
Digital Library
EC
Protein Data Bank
GO
CATH
SCOP
User Input
protein structures
articles
protein family
Motif Filter
Motif Miner
Protein Classifier
Knowledge Retriever
Feature selection Association discovery
spatial motifs
Subgraph mining
Classification
Info retrieval Text mining
family-specific motifs
experimental knowledge
Motif Navigator
Visualization
Spatial Motif Database
Spatial Motif Knowledgebase
Indexing Search
Knowledge management
4Modeling a Protein by a Set of Points
- Amino acids can be presented by points in a 3D
space.
ATOM 156 C GLY A 38 43.696 71.361 61.773 1.00
25.96 C ATOM 157 O GLY A 38 43.916 70.461 62.583
1.00 27.40 O ATOM 158 N HIS A 39 43.506 72.626
62.145 1.00 25.72 N ATOM 159 CA HIS A 39 43.583
73.021 63.550 1.00 22.52 C ATOM 160 C HIS A 39
42.367 73.829 63.983 1.00 19.35 C ATOM 161 O HIS
A 39 41.790 74.562 63.187 1.00 20.24 O ATOM 162
CB HIS A 39 44.821 73.890 63.798 1.00 26.08 C
ATOM 163 CG HIS A 39 46.117 73.173 63.590 1.00
32.47 C ATOM 164 ND1 HIS A 39 46.786 72.533
64.612 1.00 34.50 N ATOM 165 CD2 HIS A 39 46.850
72.967 62.471 1.00 31.79 C ATOM 166 CE1 HIS A 39
47.875 71.961 64.129 1.00 36.40 C ATOM 167 NE2
HIS A 39 47.937 72.209 62.832 1.00 31.42 N ATOM
168 N LEU A 40 41.986 73.701 65.248 1.00 22.27 N
ATOM 169 CA LEU A 40 40.851 74.468 65.724 1.00
21.68 C ATOM 170 C LEU A 40 41.226 75.942 65.709
1.00 23.21 C
5Protein structures are chains of amino acid
residues with certain spatial arrangements
ASP102
HIS57
SER195
ALA55
ASP194
GLY43
GLY42
SER190
GLY40
Frequent subgraph mining Given a group of
proteins G each of which is represented by a
graph and a support threshold 1 s 0, find all
maximal subgraphs which occurs in at least s
fraction of graphs in G
node ? amino acid residue edge ? potential
physical interaction
Graph complexity
Information
Challenge subgraph isomorphism (NP-complete)
6Almost-Delaunay (AD)
- A 4-tuple of points is almost-Delaunay with
parameter ?, if, by perturbing all points in the
set by at most ?, the circumscribing sphere can
become empty. - A 4-tuple of points is AD(?) if ? is the minimal
perturbation.
Vertex can move within a sphere of radius ?
R1
New tetrahedron may be formed due to the
perturbation
R4
R5
R2
Blue Delaunay is AD(0) Red is AD(?)
R3
(Bandyopadhyay and Snoeyink, SODA, 2004)
7Graph Representations
CD
E(DT) ? E(AD) ? E(CD)
8Recurring patterns from Graph Databases
Input a database of labeled undirected graphs
p2
p4
q1
s1
x
b
c
c
x
x
x
c
q2
s2
p1
y
d
d
y
d
x
x
x
x
c
c
c
a
q3
s3
p5
p3
(Q)
(S)
(P)
Output All (connected) frequent subgraphs from
the graph database.
x
y
d
3/3
2/3
c
c
c
c
3/3
3/3
c
c
x
x
c
x
y
2/3
d
y
3/3
2/3
d
d
x
x
c
c
c
9Canonical Adjacency Matrix
- The Canonical Adjacency Matrix (CAM) of a graph G
is the maximal adjacency matrix for G under a
total ordering defined on adjacency matrices.
P3 P2 P5 P4 P1
P1 P2 P3 P4 P5
P1 P2 P3 P5 P4
dxcxyc0x0b00x0a gt dxcxyc00xa0x00b
gt cycx0a0x0bxx00d
10CAM Tree Frequent Subgraphs
? 2/3
11Fast Frequent Subgraph Mining
- Spatial locality
- Subgraphs with bounded degree and size
- Apriori property
- any supergraph of an infrequent subgraph is
infrequent - eliminates unnecessary isomorphism checks
- Canonical form
- Avoid redundant examination
- Depth-first
- Incremental isomorphism check
- Better memory utilization
- The state of the art algorithm that can handle
large and complex protein graphs - Open issues
- Substitution
- Dynamics and geometric constraints
12Proof of ConceptSerine Proteases
Packing motifs identified in the Eukaryotic
Serine Protease. N total number of structures
included in the data set. s The support
threshold used to obtain recurring spatial
motifs, T processing time (in unit of second).
M motif number, C the sequence of one-letter
residue codes for the residue composition of the
motif, ? the actual number of occurrences of a
motif in the family, ?, the background frequency
of the motif, and S -log(P) where the P-value
defined by a hyper-geometric distribution. The
packing motifs were sorted first by their support
values in descending order, and then by their
background frequencies in ascending order. The
log(P) values are highlighted
13Proof of ConceptSerine Proteases
38 highly specific motifs mined from serine
proteases classified by SCOP v1.65 (Dec 2003)
1HJ9
1MD8
1OP0
1OS8
1PQ7
1P57
1SSX
1S83
14Proof of ConceptPapain-like Cysteine Protease
All the patterns have log(P) gt 49, ? support in
the PCP family, ? number of occurrences outside
the family. Patterns that contain the active diad
(His and Cys) of the proteins are highlighted.
15Proof of ConceptPapain-like Cysteine Protease
The active site in 1cqd
Choi, K. H., Laursen, R. A. Allen, K. N.
(1999). The 2.1 angstrom structure of a cysteine
protease with proline specificity from ginger
rhizome, zingiber officinale. Biochemistry, 7,
38(36), 1162433.
16Proof of ConceptFunction Inference of Orphan
Structure
1nfg
1m65
SCOP 51556
CASP5 T0147
unknown function no good sequence and global
structure alignment to known proteins 7-stranded
barrel fold, 30 motifs found
Metallo-dependent hydrolase (MDH) 8-stranded ba
(TIM) barrel fold 17 members, 49 family specific
spatial motifs
17Proof of ConceptFunction Inference II
1ecs
1twu
SCOP 54598
Yyce
Antibiotic resistance protein Glyoxalase /
bleomycin resistance / dioxygenase superfamily 4
members (SCOP 1.65), 62 family specific spatial
motifs
unknown function, not in SCOP 1.67, DALI z lt 10
in Nov 2004 46 motifs found, structurally similar
to the three new non-redundant AR proteins added
in SCOP 1.67
18References and Acknowledgement
- Collaborators
- Catherine Blake (information retrieval)
- Charlie Carter (biochemistry)
- Nikolay Dohkolyan (biophysics)
- Leonard McMillan (computer graphics)
- Jan Prins (high performance computing)
- Jack Snoeyink (computational geometry)
- Alexander Tropsha (pharmacy)
- Partially supported by
- Microsoft eScience Applications Award
- Microsoft New Faculty Fellowship
- NSF CAREER Award IIS-0448392
- NSF CCF-0523875
- NSF DMS-0406381
- Prototype deployed at
- Comparing graph representations of protein
structure for mining family-specific
residue-based packing motifs, Journal of
Computational Biology (JCB), 2005. - SPIN Mining maximal frequent subgraphs from
graph databases, Proceedings of the 10th ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining (SIGKDD), pp. 581-586,
2004. - Mining spatial motifs from protein structure
graphs,. Proceedings of the 8th Annual
International Conference on Research in
Computational Molecular Biology (RECOMB), pp.
308-315, 2004. - Accurate classification of protein structural
families using coherent subgraph analysis,
Proceedings of the Pacific Symposium on
Biocomputing (PSB), pp. 411-422, 2004. - Efficient mining of frequent subgraph in the
presence of isomorphism, Proceedings of the 3rd
IEEE International Conference on Data Mining
(ICDM), pp. 549-552, 2003. - Another 45 papers on general methodology
development directly related to this project