Title: Obol: Open Bio-Ontology Language
1ObolOpen Bio-Ontology Language
- Using grammars to extract and use implicit
knowledge in the GO and OBO - Chris Mungall
- Berkeley Drosophila Genome Project / GO Consortium
2Obol
- Obol is a system for discovering and reasoning
over hidden knowledge in ontologies - Obol is useful for helping maintain
cross-products in the Gene Ontology - Obol works by parsing syntax and semantics from
GO and OBO terms
3Motivation Ontology Maintenance
- GO 3 ontologies, 16k terms, 23k relationships
- OBO cell, biochemical, sequence and multiple
anatomical ontologies - Many GO terms are combinatorial (cross-products)
- regulation of neutrophil differentiation
- No explicit links between ontologies
- Difficult to maintain manually
4Some Sample GO terms
regulation of neutrophil differentiation. neutr
ophil differentiation. granuloctye
differentiation. smooth muscle
contraction. nucleolar chromatin. nucleolus.
oxygen transport. negative regulation of
interleukin-2 biosynthesis. oxidoreductase
activity, acting on paired donors, with
incorporation or reduction of molecular oxygen,
reduced iron-sulfur protein as one donor, and
incorporation of one atom of oxygen.
5Graph complexity
biosynthesis
regulation of biosynthesis
negative regulation of biosynthesis
regulation of cytokine biosynthesis
cytokine biosynthesis
negative regulation of cytokine biosynthesis
regulation of interleukin-2 biosynthesis
interleukin-2 biosynthesis
negative regulation of interleukin-2 biosynthesis
part-of
is-a
6Automatic inference of relationships
- Some relationships can be derived
computationally - provided we have complete logical definitions
regulation (regtypenegative)
(regprocessbiosynthesis (makesinterleukin-2)
)
Tools exist for reasoning over these logical
definitions, but
7Generating logical definitions
- Generating and maintaining logical definitions
for GO/OBO is non-trivial - Obol exploits the highly regular grammatical
structure of GO term names - regulation of X, never X regulation
- Y biosynthesis, never biosynthesis of Y
- no stemming required
- Obol derives candidate class definitions from
term names, and performs basic reasoning over them
8Obol parsing and reasoning
GO/OBO Term Lexical string
interleukin-2 biosynthesis
Class Definition(s) may involve relationships to
other OBO terms
biosynthesis(makesinterleukin-2)
interleukin-2 biosynthesis
is_a cytokine biosynthesis inferred from
interleukin-2 is_a cytokine
Inferences using definitions and existing
ontologies
9How Obol Works
- term names are broken into lexical tokens (words)
using a tokeniser - tokens are parsed using a grammar, generating
parse trees - parse trees are turned into class definitions
using transformation rules and property
definitions - transformation is reversible
- class definitions are reasoned over
- implemented in XSB Prolog
10Word tokens
- Obol uses an atomic vocabulary of word tokens
- tokens are partitioned by ontology domain
- cell, anatomy, biological process, etc
- tokens have a grammatical type
- adj, noun, prep, relational adj, special
- vocabularies need not be correct or complete
11Computational Grammars
- formal grammars can elucidate sentence structure
- grammars transform token lists into parse trees
- multiple parses may be possible
- parses are reversible
- a grammar is a collection of transformation rules
12A simple OBO term grammar
(subset of the whole OBO grammar)
Term --gt NP e.g. negative
regulation of interleukin-2 biosynthesis NP
--gt NP PP e.g. negative regulation
of interleukin-2 biosynthesis NP --gt NOUN
e.g. interleukin regulation
biosynthesis NP --gt NP-TOK e.g.
interleukin-2 NP --gt ADJ NP e.g.
negative regulation NP --gt NP NP
e.g. interleukin-2 biosynthesis PP --gt
PREP NP e.g. of interleukin-2 biosynthesis
13Applying grammar rules
pp -gt p np
term -gt np
np
np -gt np np
pp
np -gt np pp
np
np
np
np -gt np-tok
np -gt adj np
np
np -gt n
np
np
noun
prep
noun
tok
noun
adj
negative regulation of interleukin-2 biosynthesis
14Generating Class Definitions
- A parse tree shows the syntax structure of a term
- A class definition is a description of the
meaning of a term - An Obol classdef is a cross product
(intersection) of necessary and sufficient
conditions - Classdefs are generated from parse trees using
tree transform rules and property descriptions - Classdefs can be exported using obo or OWL format
15Property definitions guide class construction
np
Property name makes domain biosynthesis
range substance grammar np_modifier
np
np
interleukin-2
biosynthesis
biosynthesis(makesinterleukin-2)
16Property definitions guide class construction
np
Property name regtype domain regulation
range neg/pos grammar np_modifier
np
adj
negative
regulation
regulation(regtypenegative)
17Property definitions guide class construction
np
Property name regprocess domain regulation
range biological_process grammar prep(of)
pp
np
np
of
biosynthesis (makesIL-2)
regulation (regtypenegative)
regulation (regtypenegative) (regprocessbiosyn
thesis(makesIL-2))
18Unparseable terms and multi-parse terms
biological process
molecular function
cellular component
single-token terms excluded from this analysis
19Reasoning over class definitions
- Using class definitions, we can
- autocreate parentage for new terms
- check for missing relationships
- find inconsistencies between ontologies
- generate implicit orthogonal ontologies
- Method
- Use native OBOL rules (via prolog or DAG-Edit)
- OR use external reasoner eg RACER, FaCT
20Finding missing relationships
- Obol is run periodically on GO to check for
missing IS A and PART OF relationships - Multiple parses produce false-positives
- 223 missing relationships added to GO
- ToDo increase specificity by improving
vocabularies and property definitions
21Obol sample report
nucleolar chromatin PART OF nucleus clathrin-coate
d vesicle HAS PART clathrin coat chromoplast
membrane IS A plastid membrane nuclear
microtubule PART OF nucleus vitamin E
biosynthesis IS A vitamin E metabolism uracil
permease activity IS A permease
activity chloroplast envelope IS A plastid
envelope negative regulation of lipid
biosynthesis IS A negative regulation of
lipid metabolism ketone body metabolism IS A
ketone metabolism dense nuclear body IS A nuclear
body
inverse present
false positive!
22Aligning to the OBO cell ontology
most differentiation terms align precisely some
dont
muscle cell
???
cardiac cell differentiation
cardiac muscle cell
mesodermal cell
animal cell
DEVELOPS FROM
cardioblast differentiation
cardioblast
23Deriving existing GO relationships
24Obol as an ontology curation tool
- Obol can be used by GO curators in a variety of
ways - Behind the scenes
- Iterative
- GO curator receives periodic suggestion reports
- Continuous
- GO curator uses OBOL interactively via DAG-Edit
plugin - To help the transition to a fully specified
ontology - GO curators then maintain class definitions
- Obol as a search tool?
25Problems to address
- Integration with curation process
- Memory usage
- Syntax parsing
- chemical terms, long terms
- Dealing with and, or and not
- Generating text definitions
- Word list maintenance
- solution integrate with ontology maintenance
- Ontology dependencies
- protein and generic anatomy ontologies needed
- Obol can be used to help generate these
26Conclusions
- Obol is useful for maintainng large GO-style
ontologies - combination of semantic parsing with reasoning is
powerful - benefits of both GO-style ontology development
and formal reasoning
27Acknowledgements
- Berkeley/GO
- John Richter
- Brad Marshall
- Karen Eilbeck
- Suzanna Lewis
- Gerry Rubin
- Jackson Labs/GO
- David Hill
- Joel Richardson
- Judith Blake
GO Curators Midori Harris Jennifer Clark Amelia
Ireland Jane Lomax Manchester Chris Wroe Robert
Stevens Phillip Lord J Michael Cherry Michael
Ashburner all the GO Consortium
28The OBO Universe (partial)
Chemical
Phenotype
Function
Process
Protein
Component
Cell
Sequence
Anatomy
Fly-Anat
Fish-Anat
29Detecting inconsistencies
Z binding
Z
Y binding
Y
X
X binding
30Prolog as an ontology language
- DATABASE OF FACTS
- isa(carb_binding, binding).
- isa(polysac_binding, carb_binding).
- isa(chitin_binding, polysac_binding)
- isa(cellulose_binding, polysac_binding).
- INFERENCE RULES
- isaT(X,Y)- isa(X,Y).
- isaT(X,Y)-isa(X,Z), isaT(Z,Y).
? ?- isaT(chitin_binding, binding). ? YES ?
?-isaT(X, polysac_binding). ? Xcarb_binding. ?
Xchitin_binding. ? Xcellulose_binding. ?
?-isaT(chitin_binding, cellulose_binding). ? NO ?
?-isaT(X,Y). returns all paths
31Prolog internal representation
class(regulation ltprocessgt
qualifierclass(negative ltgeneralgt)
regulatesclass(contraction ltprocessgt
affects_cell_typeclass(mus
cle ltanatomicalgt
has_typeclass(smooth ltgeneralgt))))
regulation
contraction
muscle
smooth
negative
32Prolog Grammar Implementation
- Prolog the classic logic programming language
- High-level declarative language, natural choice
for ontologies built in database - Definite Clause Grammars (DCGs)part of the
language DCGs allow passing data up the parse
tree - XSB Prolog
- Uses tabling (more efficient, less
re-calculation) - Tabling DCGs chart parsing (Earley's
algorithm) - This means we can have left-recursive grammars
33A Formal Grammar for OBO terms
- All(?) GO/OBO terms are NOUN-PHRASES (exception
phenotypes?) - A NOUN-PHRASE is (recursively) made from
- a NOUN (includes inflected verbs eg binding)
- an ADJECTIVE followed by a NOUN-PHRASE eg inner
membrane - a NOUN-PHRASE preceeded by a NOUN-PHRASE acting
as ADJECTIVE eg clathrin coat - a NOUN-PHRASE then PREPOSITION then NOUN-PHRASE
eg regulation of transcription - an (optional) NOUN-PHRASE then a RELATIONAL
ADJECTIVE then a NOUN-PHRASE eg clathrin-coated
vesicle - Precedence rules are also required to prune parse
forest - Simple but effective
34Sentence -gt Subject Verb Object Subject -gt
Article Noun Object -gt Article Noun Article
-gt a the Verb -gt ate chased Noun
-gt cat banana mouse
Parse tree for a simple sentence, the cat ate
the banana
GENERATING Start at top ('sentence') and apply
rules until all symbols are terminal
PARSING Start at bottom a sequence of
terminals apply rules, combining symbols if
necessary
- A formal grammar is a set of production rules
operating over terminal symbols (eg words) and
non-terminal symbols (eg word/phrase categories) - The rules determine how sequences of symbols can
be transformed, making a parse tree
35NP
NP -gt NP PP
Parse tree
PP PP
-gt P NP
thick lines inidcate stem terms the parse tree
shows the synctactic structure
NP
NP -gt NP NP
NP
NP NP
NP -gt ADJ NOUN
ADJ NOUN P ADJ
NOUN NOUN
negative regulation of smooth muscle contraction
recurse down tree applying grammatical context
rules to get property fillers
Class Definition (shown as DAG)
we need new OBO format (or OWL) to
represent cross-products (aka intersections /
complete defs)
X
X
the classdef shows the logical structure
regulation
contraction
muscle
X
X
X
smooth
negative
negative regulation of smooth muscle contraction
smooth muscle contraction
smooth muscle
DAG definition is minimal non-definitional
relationships to other terms not shown
36COPII-coated vesicle membrane
class(membrane ltcomponentgt COPII-coated
vesicle membrane part_ofclass(vesicle
ltcomponentgt COPII-coated vesicle
has_partclass(coat ltcomponentgt
COPII coat
made_fromclass(COPII
ltcomplexgt COPII))))
class/term name shown in quotes these can be
derived by reversion the transformation
the above classdef is consistent with what is in
the GO cellular_component ontology
requires use of inverse properties (has_part vs
part_of) - supported in new OBO format.
37Outline
- Motivation combinatorial issues with composite
terms - Approaches annotation-time term composition vs
tools for maintenance of large DAGs - The OBOL System
- Term decomposition using grammars
- Generating computable logical class definitions
- Rules and reasoning over class definitions
- Initial Results
- Strategies for using OBOL within GO/OBO
38Manual maintenance of GO
- GO is 3 DAGs of over 16k terms
- Large DAGs of terms are hard to maintain
- cross-products produce combinatorial explosions
and highly connected sub-graphs - GO terms include OBO terms
- eg oxygen binding wing development
- Zipf's Law
- Many terms not yet used in annotation (Ogren,
pers. Comm.)
39Example combinatorial explosion
cell motility
regulation
Negative regulation
negative regulation
contraction
regulation of contraction
regulation of muscle contraction
muscle contraction
negative regulation of contraction
smooth muscle contraction
regulation of smooth muscle contraction
Negative regulation of muscle contraction
negative regulation of muscle contraction
implicit anatomical terms not shown
negative regulation of smooth muscle contraction
implicit terms
actual GO terms
40One Approach Properties
- One extreme solution is to remove composite terms
from ontology altogether - Generate anonymous composite terms at
annotation-time via property/slot restrictions to
atomic terms - binding affects(interleukin-18)
- contraction affects(muscletype(smooth))
- These bindings constitute a class definition
- But it is still necessary to make statements
about composite terms in the ontology - macrophage activation is_a immune cell activity
- fibrinolysis is_a negative regulation of blood
coagulation
41Another approach Computationally aided ontology
maintenance
- GO terms exhibit regularity in their syntactic
structure - Substring relationships highly correlated with
actual relationships - regulation of smooth muscle contraction
- smooth muscle contraction
- muscle
contraction -
contraction
Ogren PV, Cohen KB, Acquaah-Mensah GK, Eberlein
J, Hunter L. 2004. The compositional structure
of Gene Ontology terms. Pac Symp Biocomput 9
214-215.
42OBOL Syntax and Semantics
- What about using the term syntax to get at the
meaning of the term?
GO/OBO Term
interleukin binding
Class Definition may involve relationships to
other OBO terms
bindingaffects(interleukin)
interleukin binding is_a cytokine
binding inferred from interleukin is_a cytokine
Inferences
43Inference of intermediate terms and IS_As
example rule
FORALL classdef pairs IFF the stem-class is the
same AND all the property-values in the
restriction-list are identical
EXCEPT for one property, in
which the property-values are linked by an isa,
THEN the classdefs are linked by an isa
class(regulation process_regulatedR
qualQ) is_a class(regulation
process_regulatedR' qualQ)
ltgt R is_a R'
class(C P1V1 P2V2..PxVx PnVn)
is_a class(C P1V1
P2V2..PxVx' PnVn)
ltgt Vx is_a Vx'