Obol: Open Bio-Ontology Language - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Obol: Open Bio-Ontology Language

Description:

Robert Stevens. Phillip Lord. J Michael Cherry. Michael Ashburner. all the GO Consortium ... 'the cat ate the banana' negative regulation of smooth muscle contraction ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 44
Provided by: chrism91
Category:
Tags: bio | cat | language | obol | ontology | open | stevens

less

Transcript and Presenter's Notes

Title: Obol: Open Bio-Ontology Language


1
ObolOpen Bio-Ontology Language
  • Using grammars to extract and use implicit
    knowledge in the GO and OBO
  • Chris Mungall
  • Berkeley Drosophila Genome Project / GO Consortium

2
Obol
  • Obol is a system for discovering and reasoning
    over hidden knowledge in ontologies
  • Obol is useful for helping maintain
    cross-products in the Gene Ontology
  • Obol works by parsing syntax and semantics from
    GO and OBO terms

3
Motivation Ontology Maintenance
  • GO 3 ontologies, 16k terms, 23k relationships
  • OBO cell, biochemical, sequence and multiple
    anatomical ontologies
  • Many GO terms are combinatorial (cross-products)
  • regulation of neutrophil differentiation
  • No explicit links between ontologies
  • Difficult to maintain manually

4
Some Sample GO terms
regulation of neutrophil differentiation. neutr
ophil differentiation. granuloctye
differentiation. smooth muscle
contraction. nucleolar chromatin. nucleolus.
oxygen transport. negative regulation of
interleukin-2 biosynthesis. oxidoreductase
activity, acting on paired donors, with
incorporation or reduction of molecular oxygen,
reduced iron-sulfur protein as one donor, and
incorporation of one atom of oxygen.
5
Graph complexity
biosynthesis
regulation of biosynthesis
negative regulation of biosynthesis
regulation of cytokine biosynthesis
cytokine biosynthesis
negative regulation of cytokine biosynthesis
regulation of interleukin-2 biosynthesis
interleukin-2 biosynthesis
negative regulation of interleukin-2 biosynthesis
part-of
is-a
6
Automatic inference of relationships
  • Some relationships can be derived
    computationally
  • provided we have complete logical definitions

regulation (regtypenegative)
(regprocessbiosynthesis (makesinterleukin-2)
)
Tools exist for reasoning over these logical
definitions, but
7
Generating logical definitions
  • Generating and maintaining logical definitions
    for GO/OBO is non-trivial
  • Obol exploits the highly regular grammatical
    structure of GO term names
  • regulation of X, never X regulation
  • Y biosynthesis, never biosynthesis of Y
  • no stemming required
  • Obol derives candidate class definitions from
    term names, and performs basic reasoning over them

8
Obol parsing and reasoning
GO/OBO Term Lexical string
interleukin-2 biosynthesis
Class Definition(s) may involve relationships to
other OBO terms
biosynthesis(makesinterleukin-2)
interleukin-2 biosynthesis
is_a cytokine biosynthesis inferred from
interleukin-2 is_a cytokine
Inferences using definitions and existing
ontologies
9
How Obol Works
  • term names are broken into lexical tokens (words)
    using a tokeniser
  • tokens are parsed using a grammar, generating
    parse trees
  • parse trees are turned into class definitions
    using transformation rules and property
    definitions
  • transformation is reversible
  • class definitions are reasoned over
  • implemented in XSB Prolog

10
Word tokens
  • Obol uses an atomic vocabulary of word tokens
  • tokens are partitioned by ontology domain
  • cell, anatomy, biological process, etc
  • tokens have a grammatical type
  • adj, noun, prep, relational adj, special
  • vocabularies need not be correct or complete

11
Computational Grammars
  • formal grammars can elucidate sentence structure
  • grammars transform token lists into parse trees
  • multiple parses may be possible
  • parses are reversible
  • a grammar is a collection of transformation rules

12
A simple OBO term grammar
(subset of the whole OBO grammar)
Term --gt NP e.g. negative
regulation of interleukin-2 biosynthesis NP
--gt NP PP e.g. negative regulation
of interleukin-2 biosynthesis NP --gt NOUN
e.g. interleukin regulation
biosynthesis NP --gt NP-TOK e.g.
interleukin-2 NP --gt ADJ NP e.g.
negative regulation NP --gt NP NP
e.g. interleukin-2 biosynthesis PP --gt
PREP NP e.g. of interleukin-2 biosynthesis
13
Applying grammar rules
pp -gt p np
term -gt np
np
np -gt np np
pp
np -gt np pp
np
np
np
np -gt np-tok
np -gt adj np
np
np -gt n
np
np
noun
prep
noun
tok
noun
adj
negative regulation of interleukin-2 biosynthesis
14
Generating Class Definitions
  • A parse tree shows the syntax structure of a term
  • A class definition is a description of the
    meaning of a term
  • An Obol classdef is a cross product
    (intersection) of necessary and sufficient
    conditions
  • Classdefs are generated from parse trees using
    tree transform rules and property descriptions
  • Classdefs can be exported using obo or OWL format

15
Property definitions guide class construction
np
Property name makes domain biosynthesis
range substance grammar np_modifier
np
np
interleukin-2
biosynthesis
biosynthesis(makesinterleukin-2)
16
Property definitions guide class construction
np
Property name regtype domain regulation
range neg/pos grammar np_modifier
np
adj
negative
regulation
regulation(regtypenegative)
17
Property definitions guide class construction
np
Property name regprocess domain regulation
range biological_process grammar prep(of)
pp
np
np
of
biosynthesis (makesIL-2)
regulation (regtypenegative)
regulation (regtypenegative) (regprocessbiosyn
thesis(makesIL-2))
18
Unparseable terms and multi-parse terms
biological process
molecular function
cellular component
single-token terms excluded from this analysis
19
Reasoning over class definitions
  • Using class definitions, we can
  • autocreate parentage for new terms
  • check for missing relationships
  • find inconsistencies between ontologies
  • generate implicit orthogonal ontologies
  • Method
  • Use native OBOL rules (via prolog or DAG-Edit)
  • OR use external reasoner eg RACER, FaCT

20
Finding missing relationships
  • Obol is run periodically on GO to check for
    missing IS A and PART OF relationships
  • Multiple parses produce false-positives
  • 223 missing relationships added to GO
  • ToDo increase specificity by improving
    vocabularies and property definitions

21
Obol sample report
nucleolar chromatin PART OF nucleus clathrin-coate
d vesicle HAS PART clathrin coat chromoplast
membrane IS A plastid membrane nuclear
microtubule PART OF nucleus vitamin E
biosynthesis IS A vitamin E metabolism uracil
permease activity IS A permease
activity chloroplast envelope IS A plastid
envelope negative regulation of lipid
biosynthesis IS A negative regulation of
lipid metabolism ketone body metabolism IS A
ketone metabolism dense nuclear body IS A nuclear
body
inverse present
false positive!
22
Aligning to the OBO cell ontology
most differentiation terms align precisely some
dont
muscle cell
???
cardiac cell differentiation
cardiac muscle cell
mesodermal cell
animal cell
DEVELOPS FROM
cardioblast differentiation
cardioblast
23
Deriving existing GO relationships
24
Obol as an ontology curation tool
  • Obol can be used by GO curators in a variety of
    ways
  • Behind the scenes
  • Iterative
  • GO curator receives periodic suggestion reports
  • Continuous
  • GO curator uses OBOL interactively via DAG-Edit
    plugin
  • To help the transition to a fully specified
    ontology
  • GO curators then maintain class definitions
  • Obol as a search tool?

25
Problems to address
  • Integration with curation process
  • Memory usage
  • Syntax parsing
  • chemical terms, long terms
  • Dealing with and, or and not
  • Generating text definitions
  • Word list maintenance
  • solution integrate with ontology maintenance
  • Ontology dependencies
  • protein and generic anatomy ontologies needed
  • Obol can be used to help generate these

26
Conclusions
  • Obol is useful for maintainng large GO-style
    ontologies
  • combination of semantic parsing with reasoning is
    powerful
  • benefits of both GO-style ontology development
    and formal reasoning

27
Acknowledgements
  • Berkeley/GO
  • John Richter
  • Brad Marshall
  • Karen Eilbeck
  • Suzanna Lewis
  • Gerry Rubin
  • Jackson Labs/GO
  • David Hill
  • Joel Richardson
  • Judith Blake

GO Curators Midori Harris Jennifer Clark Amelia
Ireland Jane Lomax Manchester Chris Wroe Robert
Stevens Phillip Lord J Michael Cherry Michael
Ashburner all the GO Consortium
28
The OBO Universe (partial)
Chemical
Phenotype
Function
Process
Protein
Component
Cell
Sequence
Anatomy
Fly-Anat
Fish-Anat
29
Detecting inconsistencies
Z binding
Z
Y binding
Y
X
X binding
30
Prolog as an ontology language
  • DATABASE OF FACTS
  • isa(carb_binding, binding).
  • isa(polysac_binding, carb_binding).
  • isa(chitin_binding, polysac_binding)
  • isa(cellulose_binding, polysac_binding).
  • INFERENCE RULES
  • isaT(X,Y)- isa(X,Y).
  • isaT(X,Y)-isa(X,Z), isaT(Z,Y).

? ?- isaT(chitin_binding, binding). ? YES ?
?-isaT(X, polysac_binding). ? Xcarb_binding. ?
Xchitin_binding. ? Xcellulose_binding. ?
?-isaT(chitin_binding, cellulose_binding). ? NO ?
?-isaT(X,Y). returns all paths
31
Prolog internal representation
class(regulation ltprocessgt
qualifierclass(negative ltgeneralgt)
regulatesclass(contraction ltprocessgt
affects_cell_typeclass(mus
cle ltanatomicalgt

has_typeclass(smooth ltgeneralgt))))
regulation
contraction
muscle
smooth
negative
32
Prolog Grammar Implementation
  • Prolog the classic logic programming language
  • High-level declarative language, natural choice
    for ontologies built in database
  • Definite Clause Grammars (DCGs)part of the
    language DCGs allow passing data up the parse
    tree
  • XSB Prolog
  • Uses tabling (more efficient, less
    re-calculation)
  • Tabling DCGs chart parsing (Earley's
    algorithm)
  • This means we can have left-recursive grammars

33
A Formal Grammar for OBO terms
  • All(?) GO/OBO terms are NOUN-PHRASES (exception
    phenotypes?)
  • A NOUN-PHRASE is (recursively) made from
  • a NOUN (includes inflected verbs eg binding)
  • an ADJECTIVE followed by a NOUN-PHRASE eg inner
    membrane
  • a NOUN-PHRASE preceeded by a NOUN-PHRASE acting
    as ADJECTIVE eg clathrin coat
  • a NOUN-PHRASE then PREPOSITION then NOUN-PHRASE
    eg regulation of transcription
  • an (optional) NOUN-PHRASE then a RELATIONAL
    ADJECTIVE then a NOUN-PHRASE eg clathrin-coated
    vesicle
  • Precedence rules are also required to prune parse
    forest
  • Simple but effective

34
Sentence -gt Subject Verb Object Subject -gt
Article Noun Object -gt Article Noun Article
-gt a the Verb -gt ate chased Noun
-gt cat banana mouse
Parse tree for a simple sentence, the cat ate
the banana
GENERATING Start at top ('sentence') and apply
rules until all symbols are terminal
PARSING Start at bottom a sequence of
terminals apply rules, combining symbols if
necessary
  • A formal grammar is a set of production rules
    operating over terminal symbols (eg words) and
    non-terminal symbols (eg word/phrase categories)
  • The rules determine how sequences of symbols can
    be transformed, making a parse tree

35
NP
NP -gt NP PP
Parse tree
PP PP
-gt P NP
thick lines inidcate stem terms the parse tree
shows the synctactic structure

NP
NP -gt NP NP
NP
NP NP
NP -gt ADJ NOUN
ADJ NOUN P ADJ
NOUN NOUN
negative regulation of smooth muscle contraction
recurse down tree applying grammatical context
rules to get property fillers
Class Definition (shown as DAG)
we need new OBO format (or OWL) to
represent cross-products (aka intersections /
complete defs)
X
X
the classdef shows the logical structure
regulation
contraction
muscle
X
X
X
smooth
negative
negative regulation of smooth muscle contraction
smooth muscle contraction
smooth muscle
DAG definition is minimal non-definitional
relationships to other terms not shown
36
COPII-coated vesicle membrane
class(membrane ltcomponentgt COPII-coated
vesicle membrane part_ofclass(vesicle
ltcomponentgt COPII-coated vesicle
has_partclass(coat ltcomponentgt
COPII coat
made_fromclass(COPII
ltcomplexgt COPII))))
class/term name shown in quotes these can be
derived by reversion the transformation
the above classdef is consistent with what is in
the GO cellular_component ontology
requires use of inverse properties (has_part vs
part_of) - supported in new OBO format.
37
Outline
  • Motivation combinatorial issues with composite
    terms
  • Approaches annotation-time term composition vs
    tools for maintenance of large DAGs
  • The OBOL System
  • Term decomposition using grammars
  • Generating computable logical class definitions
  • Rules and reasoning over class definitions
  • Initial Results
  • Strategies for using OBOL within GO/OBO

38
Manual maintenance of GO
  • GO is 3 DAGs of over 16k terms
  • Large DAGs of terms are hard to maintain
  • cross-products produce combinatorial explosions
    and highly connected sub-graphs
  • GO terms include OBO terms
  • eg oxygen binding wing development
  • Zipf's Law
  • Many terms not yet used in annotation (Ogren,
    pers. Comm.)

39
Example combinatorial explosion
cell motility
regulation
Negative regulation
negative regulation
contraction
regulation of contraction
regulation of muscle contraction
muscle contraction
negative regulation of contraction
smooth muscle contraction
regulation of smooth muscle contraction
Negative regulation of muscle contraction
negative regulation of muscle contraction
implicit anatomical terms not shown
negative regulation of smooth muscle contraction
implicit terms
actual GO terms
40
One Approach Properties
  • One extreme solution is to remove composite terms
    from ontology altogether
  • Generate anonymous composite terms at
    annotation-time via property/slot restrictions to
    atomic terms
  • binding affects(interleukin-18)
  • contraction affects(muscletype(smooth))
  • These bindings constitute a class definition
  • But it is still necessary to make statements
    about composite terms in the ontology
  • macrophage activation is_a immune cell activity
  • fibrinolysis is_a negative regulation of blood
    coagulation

41
Another approach Computationally aided ontology
maintenance
  • GO terms exhibit regularity in their syntactic
    structure
  • Substring relationships highly correlated with
    actual relationships
  • regulation of smooth muscle contraction
  • smooth muscle contraction
  • muscle
    contraction

  • contraction

Ogren PV, Cohen KB, Acquaah-Mensah GK, Eberlein
J, Hunter L. 2004. The compositional structure
of Gene Ontology terms. Pac Symp Biocomput 9
214-215.
42
OBOL Syntax and Semantics
  • What about using the term syntax to get at the
    meaning of the term?

GO/OBO Term
interleukin binding
Class Definition may involve relationships to
other OBO terms
bindingaffects(interleukin)
interleukin binding is_a cytokine
binding inferred from interleukin is_a cytokine
Inferences
43
Inference of intermediate terms and IS_As
example rule
FORALL classdef pairs IFF the stem-class is the
same AND all the property-values in the
restriction-list are identical
EXCEPT for one property, in
which the property-values are linked by an isa,
THEN the classdefs are linked by an isa
class(regulation process_regulatedR
qualQ) is_a class(regulation
process_regulatedR' qualQ)
ltgt R is_a R'
class(C P1V1 P2V2..PxVx PnVn)
is_a class(C P1V1
P2V2..PxVx' PnVn)
ltgt Vx is_a Vx'
Write a Comment
User Comments (0)
About PowerShow.com