Title: Learning representations of molecular structure from data
1Learning representations of molecular structure
from data ?
ABSTRACT
Chomsky hierarchy
SELECTED REFERENCES
Linguistics research has developed ways to
represent the hierarchical composition of text
and speech. These representations are grounded in
the sequential nature of objects in the space of
strings and utterances. In recent years progress
has been made on the issue of learning such
models from data. We study a possibility of
extending these methods to automatically extract
hierarchical representations of molecular
structure. Such methods would result in
facilitation of similarity queries with
applications to automated drug design.
D. J. Cook, L. B. Holder, S. Su, R. Maglothin,
and I. Jonyer (2001). Structural Mining of
Molecular Biology Data. IEEE Engineering in
Medicine and Biology special issue on Advances in
Genomics 20(4) 67-74. W. R. Taylor (2002). A
periodic table of protein structures. Nature
416 p. 657. Searls, D.B. (2002). The
language of genes. Nature 420 211-217. T.
Przytycka, R. Srinivasan and G.Rose (2002).
Recursive Domains in Proteins, Protein Science
11 409-417. J. Engelfriet, J.J. Vereijken
(1997). Context-Free Graph Grammars and
Concatenation of Graphs, Acta Informatica,
34(10)773-803 Wittenburg, K., and L. Weitzman
(1998). Relational Grammars Theory and Practice
in a Visual Language Interface for Process
Modeling. Springer-Verlag, A. Stolcke (1994).
Bayesian Learning of Probabilistic Language
Models. Ph.D. dissertation, University of
California.
The language of genes by Searls
Grammar-style derivations of idealized
versions of RNA structures. a, A stem b, a
branched structure c, a pseudoknot and d,
alternative secondary structures of an
attenuator. The trees for a and b are graphical
depictions of derivations from grammars given in
the text. By convention, a starting nonterminal S
is at the root of the tree and gives rise to
branches for each symbol to which it rewrites in
the course of the derivation. The string derived
can be read by tracing the frontier or leaf nodes
of the tree, left to right (dashed blue lines).
For c and d, derivation trees are not explicitly
indicated because of the complexity of the
context-sensitive grammars required. The same
strings are also shown in linear fashion, with
dependencies indicated between terminals derived
at the same steps. Protein
domain arrangements and the Chomsky hierarchy.
Shown are backbone structures for a, cat muscle
pyruvate kinase (1pkm in Protein Data Bank minus
a short amino-terminal domain) and b, Escherichia
coli D-maltodextrin binding protein (1omp in
Protein Data Bank). At the bottom are schemas of
the domain relationships, with double arrows
connecting segments participating in the same
domain. The upper, carboxy-terminal (blue) domain
of 1pkm attaches by way of a simple
concatenation, which is a regular operation
commonly seen in proteins. The central
red-and-green /-barrel, however, is interrupted
in the middle by an insertion of the lower
(orange) domain, a context-free operation insofar
as it thus creates a strictly nested dependency
between the divided domain segments (as would any
number of domain insertions at any point).
Insertions are less common than concatenations,
but still fairly frequent. The two main domains
of 1omp, on the other hand, seem to be
interleaved, thus creating cross-serial
dependencies that are necessarily
context-sensitive. Whether the C-terminal (blue)
segment is involved fully in the lower domain's
core, however, is open to question in any case,
true interleaved structural domains seem to be
very rare. The dashed ellipses in the backbone
diagrams illustrate that the number of crossovers
between domains (1, 2 and 3, respectively) is
indicative of the level in the Chomsky hierarchy
of the resulting domain arrangement.
Existing approaches - start from intermediate
artificial elements (a-helix, amino acid,
loop) - do not model structure and function
in the same way - enforce production
rules Research We are looking for grammar-like
representation which would generate only
plausible structures, e.g. only plausible protein
folds, only viable small molecules. Could we
learn hierarchical representation of molecular
structures, which would use the same set of rules
from small molecules to RNA to proteins ? Would
statistically inferred grammars find the same
words in molecular lexicon as were established by
experts (e.g. nucleotides and amino acids)? What
else would we find in a lexicon ? Which
information is essential for unsupervised
acquisition of molecular structure? What
classes of molecules would correspond to what
grammars in the hierarchy ? Could we constrain
the class of grammars in such a way as to enforce
similarity among resulting representation in one
sense or the other?
QUESTIONS
A periodic table for protein by Taylor
Stick-figure representations of the basic
Forms. Each of the basic generating Forms is
represented by 'stick' models in which a-helices
are red and drawn thicker than the green
b-strands. a, aba-layers. Six strands are shown,
but the sheet can extend indefinitely. b,
abba-layers. As in a, the sheets can be extended.
(Removal of the a-layers leaves the common
b-'sandwich'). c, Eight-fold ab barrel. Similar
barrels with 59 strands were constructed. (See
Supplementary Information A.1 for construction
details). By deleting helices and strands from
these models, almost all known globular protein
domains of and types can be generated.
Simplified layer structure of proteins.
Layers of secondary structure (b, green a, red)
are combined to make globular protein domains.
The b-sheets are represented as bars and circles,
as they would appear when viewed looking along
their component strands. Each sheet has a
left-handed twist between the strands (not
depicted) onto which can be added curl and
stagger. This allows the sheets to progressively
'deform' from a topologically flat sheet into a
cylinder (or barrel). The two endpoints and one
intermediate stage are represented by the rows in
the figure and indexed as I (flat), C (curled)
and O (barrel). For each of these, up to four
layers of secondary structure are shown
(a-helices are drawn as red dots viewed end-on).
For simplicity, not all possible layer
combinations have been represented in
particular, those with adjacent a-layers have
been omitted because the boundary between these
is not well defined. Most biologically important
structures can be generated from three of those
represented above this set of three is referred
to as the basis set. (See Fig. 1 for
three-dimensional 'stick' figures of the basis
set.) Using the I, C, O index plus layer number
I31 can generate I21 and I11 (by the deletion of
helices) O21 can generate O11, and with the
removal of strands from the barrel, also C21 and
C11. Similarly, I42 can produce an abb and the
common bb layer structures.
SUBDUE (graph grammar induction) by HOLDER COOK
Lets talk !
Please seek me out or call my cell in case
youd like to discuss my poster or provide any
feedback. Dr. Leon Peshkin M.Dworkin room
134, Harvard University, 33 Oxford ST, Cambridge,
MA 02138, USA cell 617-699-7147