Mining Patterns in Protein Structures Algorithms and Applications - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Mining Patterns in Protein Structures Algorithms and Applications

Description:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL. Mining Patterns ... SCOP. CATH. GO. Subgraph. mining. Feature selection. Association. discovery. Classification ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 19
Provided by: weiw5
Category:

less

Transcript and Presenter's Notes

Title: Mining Patterns in Protein Structures Algorithms and Applications


1
Mining Patterns in Protein Structures
Algorithms and Applications
MotifSpace
  • Wei Wang
  • UNC Chapel Hill
  • weiwang_at_cs.unc.edu

2
Proteins Are the Machinery of Life
Protein Structure Initiative
Function
Spatial motifs
Protein Data Bank
Serine protease
Papain-like Cysteine protease
GTP binding protein
3
MotifSpace
protein classification
Digital Library
EC
Protein Data Bank
GO
CATH
SCOP
User Input
protein structures
articles
protein family
Motif Filter
Motif Miner
Protein Classifier
Knowledge Retriever
Feature selection Association discovery
spatial motifs
Subgraph mining
Classification
Info retrieval Text mining
family-specific motifs
experimental knowledge
Motif Navigator
Visualization
Spatial Motif Database
Spatial Motif Knowledgebase
Indexing Search
Knowledge management
4
Modeling a Protein by a Set of Points
  • Amino acids can be presented by points in a 3D
    space.

ATOM 156 C GLY A 38 43.696 71.361 61.773 1.00
25.96 C ATOM 157 O GLY A 38 43.916 70.461 62.583
1.00 27.40 O ATOM 158 N HIS A 39 43.506 72.626
62.145 1.00 25.72 N ATOM 159 CA HIS A 39 43.583
73.021 63.550 1.00 22.52 C ATOM 160 C HIS A 39
42.367 73.829 63.983 1.00 19.35 C ATOM 161 O HIS
A 39 41.790 74.562 63.187 1.00 20.24 O ATOM 162
CB HIS A 39 44.821 73.890 63.798 1.00 26.08 C
ATOM 163 CG HIS A 39 46.117 73.173 63.590 1.00
32.47 C ATOM 164 ND1 HIS A 39 46.786 72.533
64.612 1.00 34.50 N ATOM 165 CD2 HIS A 39 46.850
72.967 62.471 1.00 31.79 C ATOM 166 CE1 HIS A 39
47.875 71.961 64.129 1.00 36.40 C ATOM 167 NE2
HIS A 39 47.937 72.209 62.832 1.00 31.42 N ATOM
168 N LEU A 40 41.986 73.701 65.248 1.00 22.27 N
ATOM 169 CA LEU A 40 40.851 74.468 65.724 1.00
21.68 C ATOM 170 C LEU A 40 41.226 75.942 65.709
1.00 23.21 C
5
Protein structures are chains of amino acid
residues with certain spatial arrangements
ASP102
HIS57
SER195
ALA55
ASP194
GLY43
GLY42
SER190
GLY40
Frequent subgraph mining Given a group of
proteins G each of which is represented by a
graph and a support threshold 1 s 0, find all
maximal subgraphs which occurs in at least s
fraction of graphs in G
node ? amino acid residue edge ? potential
physical interaction
Graph complexity
Information
Challenge subgraph isomorphism (NP-complete)
6
Almost-Delaunay (AD)
  • A 4-tuple of points is almost-Delaunay with
    parameter ?, if, by perturbing all points in the
    set by at most ?, the circumscribing sphere can
    become empty.
  • A 4-tuple of points is AD(?) if ? is the minimal
    perturbation.

Vertex can move within a sphere of radius ?
R1
New tetrahedron may be formed due to the
perturbation
R4
R5
R2
Blue Delaunay is AD(0) Red is AD(?)
R3
(Bandyopadhyay and Snoeyink, SODA, 2004)
7
Graph Representations
CD
E(DT) ? E(AD) ? E(CD)
8
Recurring patterns from Graph Databases
Input a database of labeled undirected graphs
p2
p4
q1
s1
x
b
c
c
x
x
x
c
q2
s2
p1
y
d
d
y
d
x
x
x
x
c
c
c
a
q3
s3
p5
p3
(Q)
(S)
(P)
Output All (connected) frequent subgraphs from
the graph database.
x
y
d
3/3
2/3
c
c
c
c
3/3
3/3
c
c
x
x
c
x
y
2/3
d
y
3/3
2/3
d
d
x
x
c
c
c
9
Canonical Adjacency Matrix
  • The Canonical Adjacency Matrix (CAM) of a graph G
    is the maximal adjacency matrix for G under a
    total ordering defined on adjacency matrices.

P3 P2 P5 P4 P1
P1 P2 P3 P4 P5
P1 P2 P3 P5 P4
dxcxyc0x0b00x0a gt dxcxyc00xa0x00b
gt cycx0a0x0bxx00d
10
CAM Tree Frequent Subgraphs
? 2/3
11
Fast Frequent Subgraph Mining
  • Spatial locality
  • Subgraphs with bounded degree and size
  • Apriori property
  • any supergraph of an infrequent subgraph is
    infrequent
  • eliminates unnecessary isomorphism checks
  • Canonical form
  • Avoid redundant examination
  • Depth-first
  • Incremental isomorphism check
  • Better memory utilization
  • The state of the art algorithm that can handle
    large and complex protein graphs
  • Open issues
  • Substitution
  • Dynamics and geometric constraints

12
Proof of ConceptSerine Proteases
Packing motifs identified in the Eukaryotic
Serine Protease. N total number of structures
included in the data set. s The support
threshold used to obtain recurring spatial
motifs, T processing time (in unit of second).
M motif number, C the sequence of one-letter
residue codes for the residue composition of the
motif, ? the actual number of occurrences of a
motif in the family, ?, the background frequency
of the motif, and S -log(P) where the P-value
defined by a hyper-geometric distribution. The
packing motifs were sorted first by their support
values in descending order, and then by their
background frequencies in ascending order. The
log(P) values are highlighted
13
Proof of ConceptSerine Proteases
38 highly specific motifs mined from serine
proteases classified by SCOP v1.65 (Dec 2003)
1HJ9
1MD8
1OP0
1OS8
1PQ7
1P57
1SSX
1S83
14
Proof of ConceptPapain-like Cysteine Protease
All the patterns have log(P) gt 49, ? support in
the PCP family, ? number of occurrences outside
the family. Patterns that contain the active diad
(His and Cys) of the proteins are highlighted.
15
Proof of ConceptPapain-like Cysteine Protease
The active site in 1cqd
Choi, K. H., Laursen, R. A. Allen, K. N.
(1999). The 2.1 angstrom structure of a cysteine
protease with proline specificity from ginger
rhizome, zingiber officinale. Biochemistry, 7,
38(36), 1162433.
16
Proof of ConceptFunction Inference of Orphan
Structure
1nfg
1m65
SCOP 51556
CASP5 T0147
unknown function no good sequence and global
structure alignment to known proteins 7-stranded
barrel fold, 30 motifs found
Metallo-dependent hydrolase (MDH) 8-stranded ba
(TIM) barrel fold 17 members, 49 family specific
spatial motifs
17
Proof of ConceptFunction Inference II
1ecs
1twu
SCOP 54598
Yyce
Antibiotic resistance protein Glyoxalase /
bleomycin resistance / dioxygenase superfamily 4
members (SCOP 1.65), 62 family specific spatial
motifs
unknown function, not in SCOP 1.67, DALI z lt 10
in Nov 2004 46 motifs found, structurally similar
to the three new non-redundant AR proteins added
in SCOP 1.67
18
References and Acknowledgement
  • Collaborators
  • Catherine Blake (information retrieval)
  • Charlie Carter (biochemistry)
  • Nikolay Dohkolyan (biophysics)
  • Leonard McMillan (computer graphics)
  • Jan Prins (high performance computing)
  • Jack Snoeyink (computational geometry)
  • Alexander Tropsha (pharmacy)
  • Partially supported by
  • Microsoft eScience Applications Award
  • Microsoft New Faculty Fellowship
  • NSF CAREER Award IIS-0448392
  • NSF CCF-0523875
  • NSF DMS-0406381
  • Prototype deployed at
  • Comparing graph representations of protein
    structure for mining family-specific
    residue-based packing motifs, Journal of
    Computational Biology (JCB), 2005.
  • SPIN Mining maximal frequent subgraphs from
    graph databases, Proceedings of the 10th ACM
    SIGKDD International Conference on Knowledge
    Discovery and Data Mining (SIGKDD), pp. 581-586,
    2004.
  • Mining spatial motifs from protein structure
    graphs,. Proceedings of the 8th Annual
    International Conference on Research in
    Computational Molecular Biology (RECOMB), pp.
    308-315, 2004.
  • Accurate classification of protein structural
    families using coherent subgraph analysis,
    Proceedings of the Pacific Symposium on
    Biocomputing (PSB), pp. 411-422, 2004.
  • Efficient mining of frequent subgraph in the
    presence of isomorphism, Proceedings of the 3rd
    IEEE International Conference on Data Mining
    (ICDM), pp. 549-552, 2003.
  • Another 45 papers on general methodology
    development directly related to this project
Write a Comment
User Comments (0)
About PowerShow.com