Formalizing Taxonomy: A Status Report - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Formalizing Taxonomy: A Status Report

Description:

Anhinga. Anhinga. melanogaster. is a. is a. Articulations by Santa Barbara ... Predicted Distribution of Anhinga melanogaster based on Clement's 4th Edition ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 35
Provided by: davet160
Category:

less

Transcript and Presenter's Notes

Title: Formalizing Taxonomy: A Status Report


1
CleanTAX A Framework and System for
AutomatedReasoning with Taxonomies and
Articulations
Dave Thau, Sichen Bao and Bertram Ludäscher
keywords knowledge management, automatic
reasoning, semantic integration, biological
classification
2
Taxonomies are Everywhere
3
Taxonomies are EverywhereSystematics
Plantae
kingdom
Tracheophyta
phylum
Magnoliopsida
class
Ranunculales
order
family
Ranunculaceae
genus
Ranunculus
Ranunculus asiaticus
species
4
Different Taxonomies Often Arise
Clement's 5th Edition
Clement's 4th Edition
Anhinga
Anhinga
is a
is a
is a
is a
is a
is a
?
Anhinga rufa
Anhinga nova.
Anhinga melanogaster
Anhinga melanogaster
?
contained in
contained in
contained in
Articulations by Santa Barbara Software Products
5
Different Taxonomies Can Lead To Different Results
Predicted Distribution of Anhinga melanogaster
based on Clement's 4th Edition
Predicted Distribution of Anhinga melanogaster
based on Clement's 5th Edition
Anhinga
Anhinga
is a
is a
is a
is a
is a
is a
Anhinga rufa
Anhinga nova.
Anhinga melanogaster
Anhinga melanogaster
?
?
contained in
contained in
contained in
Articulations by Santa Barbara Software Products
Georeferenced observation data retrieved from The
Global Biodiversity Information Facility
www.gbif.org. Distribution maps created using
the GARP niche modeling algorithm embedded in a
Kepler workflow.
6
Problem Statement
  • What are taxonomies, anyway?
  • How do you know a taxonomy makes sense?
  • Given some articulations meant to translate
    between taxonomies
  • do they make sense, or are there internal
    contradictions?
  • have they left out anything which may be inferred
    logically?

7
What are Taxonomies?
  • A simple definition A directed acyclic graph of
    nodes and edges, where the edges represent a
    "subtype" relation

Anhinga
is a
is a
is a
Anhinga rufa
Anhinga nova.
Anhinga melanogaster
Potential additional constraints
  • children are disjoint (child-disjointness, D)
  • children partition their parents (coverage, C)
  • nodes are non-empty (non-emptiness, N)

We call these "latent taxonomic assumptions"
  • More than one LTA may apply
  • There are 8 combinations
  • none, C, D, N, CD, CN, DN, CDN

8
Inconsistency in a Taxonomy
  • If a taxonomy adheres to the disjoint-children,
    and non-emptiness latent taxonomic assumptions,
    the following is inconsistent (arrows are is-a
    relations)

A
B
C
D
If B and C are children of A, then they must be
disjoint. However, they both contain elements
of D
9
How do Taxonomies Relate?
  • Articulations relate nodes between taxonomies

Between any two nodes in the taxonomies, one, and
only one, of the following five relations must
hold
(ii) proper inclusion
(iii) proper inverse inclusion
M N
M gt N
M lt N
M o N
M x N
10
The Relation Lattice
  • Sometimes, a single relation between two nodes
    cannot be proven.
  • The relation lattice shows all 32 possible
    combined relations.
  • Each node represents a disjunction of relations.

11
Articulations Some Make Sense
Taxonomy 1
Taxonomy 2
A lt D
A
D
isa
isa
isa
isa
C
B
E
F
C E
B lt F
Assuming non-emptiness, disjoint children and
coverage LTAs
12
Articulations Some Are Impossible
Taxonomy 1
Taxonomy 2
A
D
isa
isa
isa
isa
C
B
E
F
C gt F
B lt F
Assuming non-emptiness, disjoint children and
coverage LTAs
13
Articulations Some Imply other Articulations
Taxonomy 1
Taxonomy 2
A D
A
D
isa
isa
isa
isa
C
B
E
F
C E
Implies B equals F
Assuming non-emptiness, disjoint children and
coverage LTAs
14
The Complexity of Developing Articulations
The Ranunculus data set 9 Taxonomies 654 Taxa 704
Articulations visualization by Martin Graham
  • How does the curator (articulator)
  • know
  • There are no contractions
  • No relations have been left out.
  • No ambiguity has been created.

15
Logic Based Approach
  • Devise a language LTax
  • First-order logic constraints on single-place
    predicates, where each predicate is a "taxon"
  • Render taxonomies and articulations between them
    into a set of first-order formulas
  • Then can ask,
  • does a taxonomy follow your definition of
    taxonomy?
  • is a pair of taxonomies plus articulations
    between them consistent?
  • are there unstated articulations?

16
Translating Taxonomy into Logic
Taxonomy and LTA Formulas
Articulation Formulas
Taxonomy and latent-taxonomic assumption rules
17
CleanTax Methodology
Given a set of taxonomies and articulations
between them
  • Check each taxonomy under each LTA set to see if
    it's consistent
  • Check the articulations under each LTA set to see
    if they are consistent
  • Check the taxonomies plus the articulations under
    the LTA sets from above and make sure the
    combination is consistent
  • If so, for each pair-wise combination of nodes,
    try to prove each possible relationship under
    each consistent LTA set.

Implemented using python. The theorem prover
prover9, and the model searcher mace4, are used
to prove relationships and check consistency.
18
Initial results
We ran two Ranunculus taxonomies (Benson 1948,
218 Taxa and Kartesz 2004, 142 Taxa) and 206
Articulations from Peet 2005. When the
taxonomies and the articulations were analyzed as
a whole, only two LTA combinations were provably
consistent no LTAs and non-emptiness. To get a
better sense for the impact of LTAs, the combined
taxonomies and articulations were divided into
82 connected subgraphs It took 2.5 hours to run
the resulting 166,920 logical tests on a 40 node
computer cluster Among these we found 5
inconsistencies and 1946 new articulations
19
Discovered Inconsistent Mappingunder the
coverage, disjointness, non-emptiness LTA set
Benson, 1948
Kartesz, 2004
lt
Ranunculus hydrocharoides
Ranunculus hydrocharoides
º
R.h. var natans
R.h. var stolonifer
R.h. var typicus
R.h. var stolonifer
R.h. var typicus
º
º
Peet, 2005 B.1948R.h.stolonifer is congruent
to K.2004R.h.stolonifer B.1948R.h.typicus is
congruent to K.2004R.h.typicus B.1948R.
hydrocharoides is congruent to K.2004R.
hydrocharoides
The most likely fix here is to change the
congruence relation between the top two nodes to
instead state that Benson's R. hydrocharoides
includes Kartesz's
20
Formal Proof of Inconsistency
21
Inferring Additional Knowledge
Does C E? Or, is C gt E?
Benson, 1948
Kartesz, 2004
lt
A Ranunculus hispidus B R.h. var caricetorum C
R.h. var hispidus D R.h. var nitidus E
Ranunculus hispidus F R.h. var eurylobus G R.h.
var greenmanii H R.h. var marilandicus I R.h.
var typicus J R. septentrionalis K R.
carolinanis
E
A
J
K
F
I
H
G
B
C
D
lt
lt
lt
lt
º
º
Taxonomy provided isa
Articulated Proper Inverse Inclusion
Articulated Congruence
22
Conclusions
  • Taxonomies are more complicated than you may have
    thought.
  • Logic is a useful tool for discovering
    inconsistencies and new relations in taxonomies
    and articulations between them.
  • This interdisciplinary line of research combines
    elements from systematics, artificial
    intelligence, and high-performance computing.

D. Thau and B. Ludäscher. Reasoning about
Taxonomies in First-Order Logic. Journal of
Ecological Informatics, (accepted pending minor
revisions).
23
  • Thanks!
  • Acknowledgements
  • Bertram Ludäscher, Shawn Bowers, Bob Peet, Martin
    Graham, Jessie Kennedy, Kirsten Menger-Anderson,
    and the SEEK project
  • Questions?

SEEK is supported by the National Science
Foundation under awards 0225676. 0225665,
0225635, and 0533368.
24
Future Work
  • Applications to other domains like proteins and
    genes (this is a plea)
  • Optimizing performance
  • Analyzing additional data sets
  • Creating reporting tools to support data curation
  • Is monadic logic enough?
  • Adding reasoning to articulation tools

25
Taxonomies are EverywhereProtein Structure
From Ed Green http//compbio.berkeley.edu/people/e
d/SeqCompEval/
26
Taxonomies are EverywherePhylogenies
From Thomas D. Als, Roger Vila, Nikolai P.
Kandul, David R. Nash, Shen-Horn Yen, Yu-Feng
Hsu, André A. Mignault, Jacobus J. Boomsma and
Naomi E. Pierce. Nature 432, 386-390.
27
Taxonomies are Useful, But Slippery
  • In all of these cases, taxonomies
  • Help us organize information
  • Allow us to make inferences at many levels of
    generality
  • However, taxonomies are simply "views" of real
    data
  • Dewey Decimal or Library of Congress?
  • Benson's view of Ranunculus or Kartesz's view?
  • Conflicting phylogenies are common
  • SCOP versus CATH

28
Applications
  • Assisting in metadata curation.
  • Add to articulation creation software
  • Articulators/curators can find logical
    inconsistencies earlier
  • Data discovery and integration
  • Eliminating incorrect articulations
  • Adding missing (logically inferred) articulations

29
Outline
  • Brief Overview of Taxonomies
  • Impact of Different Taxonomic Views on Data
    Analysis
  • Articulations Between Different Taxonomic Views
  • Using Logic to Determine Inconsistencies
  • Initial Results of Large Scale Analysis
  • Applications
  • Conclusions

30
Optimizations
LTA Optimization
The power set of the three LTAs non-emptiness,
child disjointness, and coverage, results in an 8
node lattice. Nodes are conjunctions of LTAs. If
a set of axioms is inconsistent under one node,
it will be inconsistent under all the supersets
of that node.
31
Optimizations Relation Lattice
The answers to the light blue nodes in the
lattice determine the answers to the rest of the
nodes.
32
Optimizations RCC-5
33
Formal Proof of Inconsistency
34
Classifications are Slippery
Adapted from R. Peet
Ranunculus plumosa
R.plumosa var intermedia
R.plumosa var plumosa
Ranunculus pinetcola
Ranunculus plumosa
Write a Comment
User Comments (0)
About PowerShow.com