Title: Ontologies and Biomedicine
1Ontologies and Biomedicine
- What is the "right" amount of semantics?
2Ontologies and Biomedicine
- The right amount of semantics depends on what
you want to do with it
3Ontologies and Biomedicine
- Research is based on inference from what is
known, and therefore it demands rigor
4Ontologies and Biomedicine
- Without rigor, we wontknow what we know, or
where to find it, or what to infer from it.
5Semantic Spectrum
Ambiguous
Logical and precise
Natural Language
Computable Ontology
Highly expressive
Less expressive
6Ad hoc tagging approach
- Let the users defined words and phrases
- Foregoes the use of an expertly curated
vocabulary or ontology. - Fast and distributed approach yields a vast
amount of content - No recruitment and training of people to maintain
the ontology is required. - No recruitment and training of annotators to
interpret the material is required.
7(No Transcript)
8Ad hoc tagging approach
- Tagging approach places the burden of
interpretation and classification on every end
user - Overall this is more costly and wasteful
- Is inappropriate in the scientific domain
- The problem is not about people communicating. It
is about computers and HCI.
9Build, apply, and use
- Ontology captures current scientific theory that
seeks to explain all of the existing evidence and
is used to draw inferences and make predictions - Acts like a review
- Requires curators who are experts in both the
science and logic - Ontology application is the real bottleneck
- But overall is less costly and wasteful
10- Univocity
- Terms should have the same meanings on every
occasion of use - Positivity
- Terms such as non-mammal or non-membrane do
not designate genuine classes. - Objectivity
- Terms such as unknown or unclassified or
unlocalized do not designate biological natural
kinds. - Single Inheritance
- No class in a classification hierarchy should
have more than one is_a parent on the immediate
higher level - Intelligible Definitions
- The terms used in a definition should be simpler
(more intelligible) than the term to be defined - Reality Based
- When building or maintaining an ontology, always
think carefully at how classes relate to
instances in reality - Distinguish Classes and Instances
- What is necessarily true for instances is not
necessarily true for classes
11Annotation bottleneck
- An active lab can easily generate 10-100GB of
data per month, and it is very difficult to
manage data on this scale. - Even the best analytic schemes will be for naught
if we cannot find our data. - And the data is complex
- Yet, the annotation effort required will be
utterly wasted if it cannot be reliably computed
upon.
12(No Transcript)
13Implies numerous light ontologies
- 3-dimensions
- Protein function
- Cell type
- Tissue
- Stage
- Cellular component
- Organism
- And more
14Or it implies a single complex one
- 3-dimensions
- Protein function
- Cell type
- Tissue
- Stage
- Cellular anatomy
- Organism
- And more
- Plus all of the relations between these elements
15Practicalities
- The ontology should be robust or the annotators
time is wasted - Research wont wait, data must be annotated at
the rate at which it is generated - Complex ontologies are much more difficult to get
right than lighter ones - Light ontologies are easier to build and maintain
- Complex ontologies can be built from lighter ones
16A successful case study
17The aims of GO
- To develop comprehensive shared vocabularies of
terms describing aspects of molecular biology. - To describe the gene products held in each
contributing model organism database. - To provide a scientific resource for access to
the vocabularies, the annotations, and associated
data. - To provide a software resource to assist in
curation of GO term assignments to biological
objects.
18The primary strength of the GO
- The GO covers three domains of biology
- Molecular Function
- Biological Process
- Cellular Component
- These are precisely defined axes of
classification
19The breakdown of work
- Task 1
- Building the ontology a computable description
of the biological world - Task 2
- Describing your gene productannotation
- Biological process
- Molecular function
- Cellular localization
20The early key decisions
- The vocabulary itself requires a serious and
ongoing effort. - Carefully define every concept
- Initially keep things as simple as possible and
only use a minimally sufficient data
representation. - Focus initially on molecular aspects that are
shared between many organisms.
21GO databases distributed and centralized
- Support cross-database queries
- By having a mutual understanding of the
definition and meaning of any word used to
describe a gene product - Provide database access to a common repository of
annotations - By submitting a summary of gene products that
have been annotated
22(No Transcript)
23(No Transcript)
24GODatabase.org
- Hits 77,012
- Visits 14,063
- Sites 6,638
- Averages per week
25(No Transcript)
26Number of links to a site as reported by Google
www.geneontology.org 7,240 www.godatabase.org
33 obo.sourceforge.net 10 song.sourceforge.net
6 genome.ucsc.edu 3,670 www.ncbi.nih.gov
12,000 www.ebi.ac.uk 14,900 sciencemag.org
14,900 www.ncbi.nlm.nih.gov 34,500
27Most Common GOIDs accessed via AmiGO
72020 GO0006810 transport 56862 GO0005524 ATP
binding 53622 GO0019012 virion 47773 GO0006955 i
mmune response 46943 GO0003677 DNA
binding 41474 GO0006508 proteolysis and
peptidolysis 41126 GO0006355 regulation of
transcription, DNA-dependent 40427 GO0004872 rece
ptor activity 34943 GO0005215 transporter
activity 30890 GO0007186 G-protein coupled
receptor protein signaling pathway 30001 GO000370
0 transcription factor activity 28127 GO0006118 e
lectron transport 26636 GO0005509 calcium ion
binding 24007 GO0006968 cellular defense
response 21250 GO0016486 peptide hormone
processing 20440 GO0008152 metabolism 19742 GO00
05515 protein binding 19316 GO0007155 cell
adhesion 18254 GO0005198 structural molecule
activity
28Taxon covered by the GO (some)
Arabidopsis TAIR, taxon3702 Caenorhabditis
WormBase, taxon6239 Candida albicans CGD,
taxon5476 Danio ZFIN, taxon7955 Dictyostelium
DictyBase, taxon5782 Drosophila FlyBase,
taxon7227 Mus MGI, taxon10090 Oryza sativa
Gramene, taxon39947 Oryza sativa (japonica
cultivar-group) Rattus RGD, taxon10116 Sacchar
omyces SGD, taxon4932 Leishmania major GeneDB,
taxon5664 Plasmodium falciparum GeneDB,
taxon5833 Schizosaccharomyces pombe GeneDB,
taxon4896 Trypanosoma brucei GeneDB,
taxon185431 Bacillus anthracis TIGR,
taxon198094 Coxiella burnetii TIGR,
taxon227377 Geobacter sulfurreducens TIGR,
taxon243231 Listeria monocytogenes TIGR,
taxon265669 Methylococcus capsulatus TIGR,
taxon243233 Pseudomonas syringae TIGR,
taxon223283 Shewanella oneidensis TIGR,
taxon211586 Vibrio cholerae TIGR, taxon686
29NIH-funded experimental research that uses the GO
- National Institute on Aging (NIA)
- National Institute of Allergy and Infectious
Diseases (NIAID) - National Cancer Institute (NCI)
- National Institute on Drug Abuse (NIDA)
- National Institute on Deafness and Other
Communication Disorders (NIDCD) - National Institute of Dental Craniofacial
Research (NIDCR) - National Institute of Diabetes and Digestive and
Kidney Diseases (NIDDK) - National Institute of Biomedical Imaging and
Bioengineering (NIBIB)
- National Institute of Environmental Health
Sciences (NIEHS) - National Eye Institute (NEI)
- National Institute of General Medical Sciences
(NIGMS) - National Institute of Child Health and Human
Development (NICHD) - National Human Genome Research Institute (NHGRI)
- National Heart, Lung and Blood Institute (NHLBI)
- National Library of Medicine (NLM)
- National Institute of Neurological Disorders and
Stroke (NINDS) - National Center for Research Resources (NCRR)
30Other funded experimental projects that use the GO
- Public Heath Service
- Walter Reed Army Medical Center
- United States Department of Agriculture
- Department of Defense
- USAID
- National Science Foundation
31A successful case study
- There are still challenges to meet
32Building upon (sharing) light, axiomatic
ontologies eliminates
- Spelling mistakes or differences
- oesinophil vs. eosinophil
- Differences in synonyms, names or naming
conventions - Spermatazoon, sperm cell, spermatozoid, sperm
- Differences in definitions
- pericardial cell develops_from mesodermal cell
vs. Nothing develops_from pericardial cell - Inconsistent structure
33Inconsistent structure
GO
CL
hemocyte
hemocyte differentiation (sensu Arthropoda)
plasmocyte
lamellocyte differentiation
plasmatocyte differentiation
lamellocyte
34Finer granularity in the GO
- GO
- immune cell
- activation, migration, chemotaxis
- erythrocyte differentiation is_a myeloid blood
cell differentiation
- CL
- no such term immune cell
- no such term myeloid blood cell
35Courser granularity in the GO
- GO
- neuroblast proliferation is_a cell proliferation
- CL
- neuroblast is_a neuronal stem cell is_a stem cell
is_a cell
36Even a light ontology like the GO is difficult
enough
- A methodology that enforces clear, coherent
definitions - Promotes quality assurance
- intent is not hard-coded into software
- Meaning of relationships is defined, not inferred
- Guarantees automatic reasoning across ontologies
and across data at different granularities - Consequences of inconsistencies
- Hard to synchronize manually
- Inconsistent user-search results
37Meeting the goal Drawing inferences
PMID5555
PMID4444
Direct evidence
Direct evidence
?
SP1234
SP8723
SP19345
A
B
C
D
Human
human
Indirect evidence
SP48392
B
PMID8976
Xenopus
toad
Indirect evidence
SP48291
SP38921
B
C
PMID3924
Drosophila
PMID9550
yeast
38Thank you
NCBO Reactome GO SO