Title: Principles for Building Biomedical Ontologies
1Principles for Building Biomedical Ontologies
- Suzanna Lewis
- National Center Biomedical Ontology
- 22 October 2005
- Advanced Bioinformatics, Cold Spring Harbor
2National Center Biomedical Ontologyhttp//bioonto
logy.org/
- Mark Musen
- Suzanna Lewis
- Barry Smith
- Sima Misra
- Daniel Rubin
- Michael Ashburner
- Monte Westerfield
- Ida Sim
- PI Core 1 computer science (SMI)
- Co-PI Core 2 bioinformatics (BiKR GO)
- Core 6 Outreach and training (ECOR)
- Associate Program Director
- Program Director
- Core 3 Phenotype Project (Cambridge FlyBase
and GO) - Core 3 Phenotype Project (UOregon PI of ZFIN)
- Core 3 HIV clinical trials Project (UCSF)
3BiKRs
- Sima Misra
- Shu Shengqiang
- Christopher J. Mungall
- Nomi Harris
- John Day-Richter
- Karen Eilbeck
- Mark Gibson
4Outline for the Morning
- A definition of ontology
- Four sessions
- Organizational Challenges
- Principles for Ontology Construction
- Case Studies from the GO
- Case Studies for group discussion.
5My newbie questions
What Ive heard
- Organism, environment, data quality and
attribution
- Where is the data generated?
- TIGR, Sanger, JGI, and coming soon to a 954 near
you!
- Still an issue. Low threshold of effort relative
to benefits of complying
- Data it is accumulating on disks across the world
and wed like to be able to locate and use it
The hardest part Sharing (semantics)
6Ontologies help with decision making
Where should I eat?
handy ontology tells us whats there
7Type of cuisine
(Presumable) country of origin
Ontologies dont just organize data they also
facilitate inference, and that creates new
knowledge, often unconsciously in the user.
8What a computer would likely infer about the
world from this helpful ontology
Flag of fresh juice
Fresh Juice is a national cuisine
Where delicatessen food hails from
Frozen Yogurt cuisine in search of a national
identity?
9Ontology is all about meaning
- Communities form (scientific) theories
- that seek to explain all of the existing evidence
- and can be used for prediction
- We make inferences and decisions based upon what
we know about (biological) reality.
10Make our meanings clear enough for a computer to
understand
- An ontology is a computable representation of
this underlying (biological) reality. - An ontology enables a computer to reason over the
data in (some of) the ways that we do - particularly to query and locate relevant data.
- A shared, common, backbone taxonomy of relevant
entities, and the relationships between them,
within an application domain. - Referred to by information scientists as an
Ontology'.
11But really
- What is an Ontology?
- From Aristotle to Artificial Intelligence
- It is a formalism of what exists
- Follows formal rules for creating definitions
originally laid down by Aristotle. - A definition is the specification of the essence
(nature, invariant structure) shared by all the
members of a class or natural kind.
12The Aristotelian Methodology
- Topmost nodes are the undefinable primitives.
- The definition of a class lower down in the
hierarchy is provided by specifying the parent of
the class together with the relevant differentia. - Differentia tells us what marks out instances of
the defined class within the wider parent class
as in - Plasma membrane
- is a cell part immediate parent
- that surrounds the cytoplasm differentia
13classes
Physical object (substance)
mammal
frog
leaf class
all members of the class frog share a froggy
nature
14Anatomical structures
Lung
Heart
Thorax
Cell
Cornelius Rosse
15Content of FMA
Challenge Duplicate graphical model in symbolic
model
Universals or classes Kinds of anatomical
entities
Adapted from Bloom Fawcett Textbook of
Histology 1994 12th ed Chapman Hall
16Content of FMA
171. Organizational Challenges
- http//obo.sourceforge.net
18So you want an ontology
- What do you have to do to make/get/use/steal/beg
one?
19Why
Survey
Domain covered?
Public?
Community?
Active?
Salvage
Develop
Applied?
Improve
yes
no
Collaborate Learn
20What you must do
- Justify exactly why there is a need
- Scope it very, very tightly
- Communicate with people
21The decisions you must make
- What domain does it cover?
- It is privately held?
- Is it active?
- Is it applied?
22Survey
Why
Domain covered?
Public?
Community?
Active?
Salvage
Develop
Applied?
Improve
yes
no
Collaborate Learn (Listen to Barry)
23Due diligence background research
- Step 1 Learn what is out there
- The most comprehensive list is on the OBO site.
http//obo.sourceforge.net - Assess ontologies critically and realistically.
- Make contact
24Why
Survey
Domain covered?
Public?
Community?
Active?
Salvage
Develop
Applied?
Improve
yes
no
Collaborate Learn (Listen to Barry)
25Ontologies must be shared
- Proprietary ontologies
- Belief that ownership of the terminology gives
the owners a competitive edge - For example, Incyte or Monsanto in the past,
SNOMED for non-US. - Data cannot be shared if the ontologies
describing the data are not shared. - Dont reinventUse the power of combination and
collaboration
26Why
Survey
Domain covered?
Public?
Community?
Active?
Salvage
Develop
Applied?
Improve
yes
no
Collaborate Learn (Listen to Barry)
27Pragmatic assessment of an ontology
- Is there access to help, e.g.
- help-me_at_weird.ontology.net ?
- Does a warm body answer help mail within a
reasonable timesay 2 working days ?
28Why
Survey
Domain covered?
Public?
Community?
Active?
Salvage
Develop
Applied?
Improve
yes
no
Collaborate Learn (Listen to Barry)
29Use it to improve it
- Every ontology improves when it is applied to
actual data - It improves even more when these data are used to
answer questions - There will be fewer problems in the ontology and
more commitment to fixing remaining problems when
important research data is involved that
scientists depend upon - Be very wary of ontologies that have never been
applied
30Work with that community
- To improve (if you found one)
- To develop (if you did not)
- Getting it right
- It is impossible to get it right the 1st (or 2nd,
or 3rd, ) time. - What we know about reality is continually growing
31Implication prepare for change
- Establish a mechanism for change.
- Use CVS or Subversion.
- Changes must be reviewed by experts
- Unique Identifiers
- Versions
- Archives
32Ontology development is hard
- Have a stake in seeing it work.
- Have broad, detailed domain knowledge.
- Will engage in vigorous debate without engaging
egos. - Will do concrete work and attend frequent working
sessions (quarterly), phone conferences (weekly),
e-mail correspondence (daily).
332. Principles for Ontology Construction
34Why do we need rules for good ontology?
- Ontologies must be intelligible
- to humans (for annotation) and
- to machines (for reasoning and error-checking)
- Unintuitive rules for classification lead to
entry errors (problematic links) - Facilitate training of curators
- Overcome obstacles to alignment with other
ontology and terminology systems - Enhance harvesting of content through automatic
reasoning systems - Following basic rules makes more useful ontologies
35Aristotles categories
This is Aristotles list of types of predication,
that is, the different ways in which things can
be said to be. He identifies 10 mutually
exclusive categories.
36SNOMED-CT Top Level
- Substance
- Body Structure
- Specimen
- Context-Dependent Categories
- Attribute
- Finding
- Staging and Scales
- Organism
- Physical Object
- Events
- Environments and Geographic Locations
- Qualifier Value
- Special Concept
- Pharmaceutical and Biological Products
- Social Context
- Disease
- Procedure
- Physical Force
37Examples of Rules
- Dont confuse instances with universals
- Your navel (instance) is not the abstract
representation of all navels - Your microarray result is not the abstract
representation of all microarray results - The meaning of an ontology should not change when
the programming language changes
38First Rule Univocity
- Terms (including those describing relations)
should have the same meanings on every occasion
of use. - In other words, they should refer to the same
kinds of instances in reality
39Example of univocity problem in case of part_of
relation
- (Old) Gene Ontology
- part_of may be part of
- flagellum part_of cell
- part_of is at times part of
- replication fork part_of the nucleoplasm
- part_of is included as a sub-list in
40Second Rule Positivity
- Complements of classes are not themselves
classes. - Terms such as non-mammal, or non-frog, or
non-membrane do not designate genuine classes.
41Third Rule Objectivity
- Which classes exist is not a function of our
biological knowledge. - Terms such as unknown or unclassified do not
designate biological natural kinds.
42Fourth Rule Single Inheritance
- No class in a classificatory hierarchy should
have more than one is_a parent on the immediate
higher level - I.e. no diamonds
43Following the single inheritance rule
- The position of a term within the hierarchy
enriches its own definition by incorporating
automatically the definitions of all the terms
above it. - The entire information content of the term
hierarchy can be translated very cleanly into a
computer representation
44Problems with multiple inheritance
- B C
- is_a1 is_a2
- A
- is_a no longer univocal
45Fifth Rule Clarity of Text Definitions
- The terms used in a definition should be simpler
(more intelligible) than the term to be defined - otherwise the definition provides no assistance
to human understanding - Machines can cope with the full formal
representation (it doesnt need the text)
46Sixth Rule Basis in Reality
- When building or maintaining an ontology, always
think carefully about how classes (types, kinds,
species) relate to instances in reality - Axioms governing instances
- Every class has at least one instance (exceptions
will occur at top levels) - Each child class has a smaller collection of
instances than its parent class
47Axiom Every parent class has at least two
children
48The reason that rules are important
Interoperability
- Ontologies should work together
- Avoid redundancy in ontology building
- Support reuse
- Ontologies should be capable of being used by
other ontologies (cumulation)
49The problem of ontology re-use
- SNOMED
- MeSH
- UMLS
- NCIT
- HL7-RIM
- None of these have clearly defined relations
- Still remain too much at the level of TERMINOLOGY
- Not based on a common set of rules
- Not based on a common set of relations
50An example of unclear relationship use
- A is_a B
- A is more specific in meaning than B
- HL7-RIM
- Individual Allele is_a Act of Observation
- cancer documentation is_a cancer
- disease prevention is_a disease
51How to define A is_a B
- A is_a B def.
- A and B are names of universals (natural kinds,
types) in reality - all instances of A are as a matter of biological
science also instances of B
52Benefits of well-defined relationships
- If the relations in an ontology are well-defined,
then reasoning can cascade from one relational
assertion (A R1 B) to the next (B R2 C).
Relations used in ontologies thus far have not
been well defined in this sense. - Find all DNA binding proteins should also find
all transcription factor proteins because - Transcription factor is_a DNA binding protein
53Biomedical data integration / interoperability
- Will never be achieved through integration of
meanings or concepts - The problem different user communities use
different concepts - What is really needed is a well-defined, commonly
used set of relationships
54Seventh Rule Distinguish Universals and Instances
- A good ontology must distinguish clearly between
- universals (types, kinds, classes)
- and
- instances (tokens, individuals, particulars)
55Why distinguish classes from instances?
- What holds on the level of instances may not hold
on the level of universals - For example, my definition of an adjacent_to
relation requires that it work in either
direction - (This particular) nucleus adjacent_to (this
particular) cytoplasm - Always true
- Cytoplasm adjacent_to nucleus
- Not always true
56Using relations
- Between classes
- is_a, part_of, ...
- Between an instance and a class
- this explosion instance_of the class explosion
- Between instances
- Marys heart part_of Mary
- Relations must be defined to always work
57Defining the part_of relation can be a problem
- part_of as a relation between classes versus
part_of as a relation between instances - nucleus part_of cell (classes)
- your heart part_of you (instances)
- testis part_of human being ?
- heart part_of human being ?
- human being has_part human testis ?
58Similar considerations are required to clearly
define nearly all relations
- A causes B
- A is_located in B
- A is_adjacent_to B
- A derives_from B
- Zygote derives_from ovum, sperm
- A transformation_of B
- Adult transformation_of child
59The Rules
- Univocity Terms should have the same meanings on
every occasion of use - Positivity Terms such as non-mammal or
non-membrane do not designate genuine classes. - Objectivity Terms such as unknown or
unclassified or unlocalized do not designate
biological natural kinds. - Single Inheritance No class in a classification
hierarchy should have more than one is_a parent
on the immediate higher level - Intelligibility of Definitions The terms used in
a definition should be simpler (more
intelligible) than the term to be defined - Basis in Reality When building or maintaining an
ontology, always think carefully at how classes
relate to instances in reality - Distinguish Classes and Instances
60Some rules are Rules of Thumb
- The world is full of difficult trade-offs
- The benefits of formal (logical and ontological)
rigor need to be balanced - Against the constraints of computer tractability,
- Against the needs of biomedical practitioners.
- BUT do the very best you can!
613. Case Studies from the GO
- http//www.geneontology.org
62How has GO dealt with some specific aspects of
ontology development?
- Univocity
- Positivity
- Objectivity
- Definitions
- Formal definitions
- Written definitions
- Ontology Re-use (Alignment)
63The Challenge of UnivocityPeople call the same
thing by different names
Taction
Tactile sense
Tactition
?
64Univocity GO uses one term and many
characterized synonyms
Taction
Tactile sense
Tactition
perception of touch GO0050975
65The Challenge of Univocity People use the same
words to describe different things
66Bud initiation? How is a computer to know?
67Univocity GO adds sensu descriptors to
discriminate among organisms
68The Challenge of Positivity
Some organelles are membrane-bound. A centrosome
is not a membrane bound organelle, but it still
may be considered an organelle.
69The Challenge of Positivity Sometimes absence is
a distinction in a Biologists mind
non-membrane-bound organelle GO0043228
membrane-bound organelle GO0043227
70Positivity
- Note the logical difference between
- non-membrane-bound organelle and
- not a membrane-bound organelle
- The latter includes everything that is not a
membrane bound organelle!
71The Challenge of Objectivity Database users want
to know if we dont know anything (Exhaustiveness
with respect to knowledge)
We dont know anything about the ligand that
binds this type of GPCR
We dont know anything about a gene product
with respect to these
72Objectivity
- How can we use GO to annotate gene products when
we know that we dont have any information about
them? - Currently GO has terms in each ontology to
describe unknown - An alternative might be to annotate genes to root
nodes and use an evidence code to describe that
we have no data. - Similar strategies could be used for things like
receptors where the ligand is unknown.
73GPCRs with unknown ligands
We could annotate to this
74GO Definitions
A definition written by a biologist necessary
sufficient conditions written definition (not
computable)
Graph structure necessary conditions formal (com
putable)
75Relationships and definitions
- Important considerations
- Placement in the graph- selecting parents
- Appropriate relationships to different parents
- True path violation
76True path violationWhat is it?
nucleus
Part_of relationship
..the path from a child term all the way up to
its top-level parent(s) must always be true".
chromosome
Is_a relationship
Mitochondrial chromosome
77True path violationWhat is it?
nucleus
chromosome
Is_a relationships
Part_of relationship
Nuclear chromosome
Mitochondrial chromosome
78The Importance of synonymsis tRNA a function?
Molecular_function
Triplet codon amino acid adaptor activity
GO Definition Mediates the insertion of an amino
acid at the correct point in the sequence of a
nascent polypeptide chain during protein
synthesis. Synonym tRNA
79Ontology integrationOne of the current goals of
GO is integration
References to Cell Types in GO
Cell Types in the Cell Ontology
with
- cone cell fate commitment
- keratinocyte differentiation
- adipocyte differentiation
- dendritic cell activation
- garland cell differentiation
- heterocyst cell differentiation
80We can integrate the GO with other ontologies
- Chemical ontologies
- 3,4-dihydroxy-2-butanone-4-phosphate synthase
activity - Anatomy ontologies
- metanephros development
- GO itself
- mitochondrial inner membrane peptidase activity
- Nota bene some time and effort will be required
81Building Ontology
Improve
Collaborate and Learn
82Applied Ontology a summary
- Dedicated editors
- Practice good ontological hygiene
- Engage the community
- Reward compliance and get the ontology into use
- Plan for change over time
- KISS Concentrate on what you can definitely
agree upon the steps you can take with certainty.
834. Case Studies for group discussion
84mitosis and meiosis
- It's been a full lunar cycle since we last talked
about this on the mailing list, and I would like
to draw everyone's attention once again to the
exciting topics of chromosome segregation,
nuclear division and cell division. The basic
problem is the multiplicity of meanings attached
to 'mitosis'. The word are used in the literature
and colloquially to represent everything from
chromosome segregation up to a full round of
nuclear and cell division and there is no
consensus on how to define it in scientific or
general dictionaries (check www.onelook.com for
proof). To compound the problem, the only process
common to all species which undergo 'mitosis' is
chromosome segregation not all species undergo
nuclear division or cell division during the
processes described in the literature as
'mitosis'. In the ontologies, we currently have
'mitosis' defined as chromosome segregation and
nuclear division. This is therefore wrong for
those species in which there is no nuclear
division accompanying chromosome segregation. How
are we going to define mitosis?
85- Events of the mitotic cell cycle that need to be
represented - mitotic chromosome segregation
- mitotic nuclear division
- mitotic cell division
- Only component common to all these is mitotic
chromosome segregation. - Structure must be flexible enough to accommodate
any of the flavors of 'mitosis, no matter what
the species and no matter whether the annotator
has read the definition or not.
86(No Transcript)
87Backing up assertions
- QUESTION What evidence code is appropriate to
use for statements of common knowledge?
88- The current documentation states that TAS may be
used as the evidence code for statements of
common knowledge. - For example, lets say you have a paper that says
that Protein X is an xxxxx , with a direct assay
for activity, so you can use IDA for this
function term. Then it also makes a mutation in
the gene for Protein X and shows that it is
involved in process yyyy, so you can use IMP for
the process term. But, the paper does not have
any direct evidence about the localization of
Protein X. However, everyone knows that process
yyyy occurs in the cytoplasm, so you can annotate
protein X to the component term cytoplasm
GO5737 by TAS using a general reference like
Biochemistry by Lupert Stryer.
89- There is not really a traceable statement in
Stryer providing evidence that process yyyy
occurs in this location in yeast. - SGD feels that it is better to use the newer
evidence code IC for these common knowledge
types of annotations. Thus, if an SGD curator
felt that it was reasonable to make the
annotation cytoplasm based on the knowledge
that Protein X the process annotation yyyy, then
the curator could assign the component term
cytoplasm GO5737 using IC and the GOid of
the process term yyyyy.
90- many of these common knowledge types of
statements are often not well based in actual
experiments conducted on the organism of
interest, that early biochemists would often
perform experiments with materials that were easy
to obtain, e.g. calf thymus, and assume that this
accurately represented the situation for another
organism, e.g. human. This may or may not be the
case.
91What is the most appropriate GO term for
annotating a response to methylmercury?
- "Response to mercury ion" doesn't seem quite
right, as it specifically states that the
response is "as a result of exposure to mercuric
ions (Hg2)", but the more general-sounding
"response to mercury" is a synonym of it. In the
publication I am working on, they exposed
zebrafish to methylmercury and documented the
resulting changes in gene expression.
92"Response to mercury ion
- Definition A change in state or activity of the
organism (in terms of movement, secretion, enzyme
production, gene expression, etc.) as a result of
exposure to mercuric ions (Hg2). - Synonyms response to mercuric, response to
mercury
93Homeobox gt DNA binding?
- http//www.geneontology.org/email-annotation/annot
ation-arc/annotation-2005/0208.html
94- Bloggers and other online groups (eg.
del.icio.us, Flickr online photo archive,
Technorati) have been self-categorizing or
'tagging' web sites and their content using
user-defined words and phrases and not an
expertly curated vocabulary or ontology. The end
result is that a vast amount of content has been
indexed using a rich vocabulary of tags (to date,
technorati has over 1.2 billion links tagged with
1.2 million tags). - Whilst this certainly lacks the formal
consistency that would be obtained with curated
annotation against a standard vocabulary, the
quantity of content being categorized far exceeds
what could be done by a group of annotators and
perhaps is richer because the tags are defined by
the users and creators of that content, not by a
third party interpreting the material after the
fact. - Given the ever increasing quantity of scientific
data, the proliferation of online publishing,
etc., could scientists tagging their own data
with their own terms be the way to go?
95- How can you recruit and train people, in both
logic and biology, given that without a
sufficient number of competent personnel the
ontology cannot be maintained?
96Thanks to NIH and HHMI for funding and support
- And to my fantastic colleagues (whose slides
these are) - MICHAEL ASHBURNTER, BARRY SMITH, DAVID HILL,
CORNELIUS ROSSE CHRIS MUNGALL
97P.S. Graphical User Interfaces
98Common pitfalls
- Dont confuse instances with artifacts of your
database representation...
99Instances are not included!
- It is the universals that are important
- though instances must be taken into account.
- Please keep this in mind, it is a crucial to
understanding the tutorial - Simon is an instance of the universal (class)
human
100Concept
- Concepts are in your head and will change as our
understanding changes - Universals exist and have an objective reality
101Ontologies as Controlled Vocabularies
- expressing discoveries in the life sciences in a
uniform way - providing a uniform framework for managing
annotation data deriving from different sources
and with varying types and degrees of evidence
102Structured definitions contain both genus and
differentiae
Essence Genus Differentiae
neuron cell differentiation Genus
differentiation (processes whereby a
relatively unspecialized cell acquires the
specialized features of..) Differentiae acquires
features of a neuron
103Key ideaTo define ontological relations
- Move from associative relations between meanings
to strictly defined relations between the
entities themselves. - The relations can then be used computationally in
the way required. - For example part_of, develops_from
- Definitions will enable computation
- To define relations we must look at more than the
classes. - We need also to take account of instances and time
104is_a is pressed into service to mean a variety
of different things
- shortfalls from single inheritance are often
clues to multiple is_a classification meanings - the resulting ambiguities make it difficult for
curators to reliably enter new terms (errors). - serves as obstacle to integration with
neighboring ontologies - The success of ontology alignment depends
crucially on the degree to which basic
ontological relations such as is_a and part_of
can be relied on as having the same meanings in
the different ontologies to be aligned.
105Definitions of the all-some form
- Allow cascading inferences
- If A R1 B and B R2 C, then we know that
- Every A stands in R1 to some B,
- but we know also that, whichever B this is, it
can be plugged into the R2 relation, because R2
is defined for every B.
106Not only relations
- We can apply the same methodology to other
top-level categories in ontology, e.g. - anatomical structure
- process
- function
- regulation, inhibition, suppression, co-factor
... - boundary, interior
- contact, separation, continuity
- tissue, membrane, sequence, cell
107To the degree that the above rules are not
satisfied, error checking and ontology alignment
will be achievable, at best, only with human
intervention and via force majeure
108What we have argued for
- A methodology which enforces clear, coherent
definitions - This promotes quality assurance
- intent is not hard-coded into software
- Meaning of relationships is defined, not inferred
- Guarantees automatic reasoning across ontologies
and across data at different granularities
109The importance of relationships
- Cyclin dependent protein kinase
- Complex has a catalytic and a regulatory subunit
- How do we represent these activities (function)
in the ontology? - Do we need a new relationship type (regulates)?
Molecular_function
Catalytic activity
Enzyme regulator activity
protein kinase activity
Protein kinase regulator activity
protein Ser/Thr kinase activity
Cyclin dependent protein kinase activity
Cyclin dependent protein kinase regulator activity
110GO textual definitions Related GO terms have
similarly structured (normalized) definitions
111Alignment of the Two Ontologies will permit the
generation of consistent and complete definitions
GO
Cell type
Osteoblast differentiation Processes whereby an
osteoprogenitor cell or a cranial neural crest
cell acquires the specialized features of an
osteoblast, a bone-forming cell which secretes
extracellular matrix.
New Definition
112Alignment of the Two Ontologies will permit the
generation of consistent and complete definitions
id GO0001649 name osteoblast
differentiation synonym osteoblast cell
differentiation genus differentiation GO0030154
(differentiation) differentium
acquires_features_of CL0000062
(osteoblast) definition (text) Processes whereby
a relatively unspecialized cell acquires the
specialized features of an osteoblast, the
mesodermal cell that gives rise to bone
Formal definitions with necessary and sufficient
conditions, in both human readable and computer
readable forms
113part_of
- part_of must be time-indexed for spatial classes
- A part_of B is defined as
- Given any instance a and any time t,
- If a is an instance of the universal A at t,
- then there is some instance b of the universal B
- such that
- a is an instance-level part_of b at t
114derives_from
C1 c1 at t1
C c at t
time
C' c' at t
ovum
zygote
derives_from
sperm
115transformation_of
- C2 transformation_of C1 is defined as
- Given any instance c of C2
- c was at some earlier time an instance of C1
116embryological development
117tumor development
118Key
- In the following discussion
- Classes are in upper case
- A is the class
- Instances are in lower case
- a is a particular instance
119Placement in the graph
- Example- Proteasome complex