Title: Science Environment
1Science Environment for Ecological
Knowledge Jessie Kennedy School of Computing,
Napier University, Edinburgh
2The SEEK Prototype Ecological Niche Modeling
Geographic Space
Ecological Space
Biodiversity information e.g. data from museum
specimens, ecological surveys
ecological niche modeling
occurrence points on native distribution
Geospatial and remotely sensed data
Results taken to integrate with other data realms
(e.g., human populations, public health, etc.)
Native range prediction
3Species prediction map
Predicted Distribution Amur snakehead (Channa
argus)
Image from http//www.lifemapper.org
4SEEK Overview
Semantic Mediation System Smart data discovery
and integration
- Analysis and Modelling System (Kepler)
- Modelling scientific workflows
- EcoGrid
- Making diverse environmental data systems
interoperate
Taxon WG Taxonomic name/concept resolution server
5Scientific workflows
EML provides semi-automated data
binding Scientific workflows represent knowledge
about the process AMS captures this knowledge
6Kepler Ecological Niche Model
7Metadata driven data ingestion
- Key information needed to read and machine
process a data file is in the metadata - Physical descriptors (CSV, Excel, RDBMS, etc.)
- Logical Entity (table, image, etc) and Attribute
(column) descriptions - Name
- Type (integer, float, string, etc.)
- Codes (missing values, nulls, etc.)
- Integrity constraints
- Semantic descriptions (ontology-based type
systems)
8Ecological ontologies
- What was measured (e.g., biomass)
- Type of quantity measured (e.g., Energy)
- Context of measurement (e.g., Psychotria
limonensis) - How it was measured (e.g., dry weight)
9Semantic Mediation
- Label data with semantic types
- Label inputs and outputs of analytical components
with semantic types - Use reasoning engines to generate transformation
steps - Use reasoning engine to discover relevant
components
Data
Ontology
Workflow Components
10Data integration
- Homogeneous data integration
- Integration of homogeneous data via EML metadata
is relatively straightforward - Heterogeneous Data integration
- Requires advanced metadata and processing
- Attributes must be semantically typed
- Collection protocols must be known
- Units and measurement scale must be known
- Measurement relationships must be known
- e.g., that ArealDensityCount/Area
11Life Sciences Data
- Much of the data gathered in ecological studies
and used in ecological data analysis is
bio-referenced data - typically organisms are referenced by a Latin
name - Many analyses requires integrating data
originating in many locations and at various
points in time - for most bio-referenced data, integration
involves matching on organism name
12Biological (scientific) Names
- Used for communicating information about known
organisms and groups of organisms taxa - Framework for all biologists to communicate with
- Taxonomists apply scientific names to species and
higher taxa in their classifications - Formalized and validated according to strict
codes of nomenclature - (different depending on kingdom)
- Latin name is a polynomial for species and below
monomial for genus and above - Quoted as LatinName NameAuthors Year
- Example Carya floridana Sarg. 1913
13Classification, Concepts Names
14Classification, Concepts Names
15Taxonomic history of Aus L. 1758
bea and cea noted as invalid names and replaced
with beus and ceus. Pyle 1990
16Problems with Scientific Names
- Often recorded inappropriately in datasets
- No author and/or year (e.g. Carya floridana)
- Abbreviated (e.g. C. floridana)
- Internal code (e.g. PicRub for Picea rubens)
- Vernacular used (e.g. Scrub Hickory)
- Misspelled
- Are not unique
- Re-use of names with changed definition
- Name is ambiguous without definition
- Subject to name alterations and 'corrections'
over time - (e.g. Code changes its rules)
17Concepts
- Full Scientific name according to (Author
Publication Date) Definition - Carya floridana Sarg. (1913) according to
Charles Sprague Sargent, Trees Shrubs 2193
plate 177 (1913) Definition - Original concept
- 1st use of name as described by the taxonomist
- same author date in scientific name and the
according to - same publication for original concepts and name
- Revised concept
- Re-classification of a group
- different author date in according to
- Carya floridana Sarg. (1913) according to Stone
FNA 3424 (1997) Definition - Should be used for communicating about groups of
organisms - Full Scientific name according to (Author
Publication Date) - definition clear can get the definition
- comparing or integrating data based on concepts
is more accurate - Can GUIDs help?
18Concepts
- Concepts are are described in many ways
- Created by someone - an Author
- Described in a Publication
- Given a Name
- May or may not be valid in terms of the
nomenclatural codes - Depending on the taxonomists working practice,
defined by - the set of Specimens examined
- (type specimens and others)
- Common set of Characters
- data recorded by taxonomists to describe
specimens and taxa - context dependent differentiate taxa rather than
fully describe them - use natural language with all its ambiguities
- Relationships to other Taxon Concepts
- Taxon circumscription
- the lower level taxa
- Congruence, overlap etc to taxa in other
classifications
19Legacy Data
- In legacy data names often appear in place of
concepts - Names are imprecise
- are inappropriate for referring to information
regarding taxon e.g. observational/collection
data - BUTsometimes thats all we have
- How do we interpret names?..
- potentially multiple definitions
- the sum of all definitions that exist for the
name - would that make any sense conflicts?
- one of the existing definitions
- how can we choose?
- the attributes in common to all the definitions
- would that leave any?
- represented by the type specimen
- but what does that mean? very subjective..
20Legacy Names as Concepts
- Nominal concepts
- Sub-set of TaxonConcepts
- Name but no AccordingTo
- non-unique (concept) identifier elements
- can have a unique concept GUID
- No definition
- Explicitly saying its something with this name
but not really sure what is/was meant - Encourage people to understand and address the
issue of names - Allowing mark-up of collections with names allows
people to believe names are really good enough - Important problem - needs to be tackled sooner
rather than later - will improve long term usefulness of scientific
data - ease integration
21SEEK Taxon
- Build a Name/Concept resolution server
- TOS (Kansas)
- Taxonomic Concept Schema
- TCS (Napier)
- Exchange of taxonomic Info
- TDWG/GBIF standard
- Basis for TOS
- GUIDs
- GBIF/SEEK etc..
- Tools to relate and compare concepts
- Taxonomy Comparison Visualisation Tool (Napier)
- Concept Mapper Tool (UNC)
22Concept Comparison Visualisation
23Taxon Concept Schema
- TCS developed to allow exchange of taxonomic
names/concept data - Based on consultation with range of users
- understand users notions of taxonomic concept
- what information they consider part of a concept
- Presentations at meetings including 2 TDWG
- Agreement that concepts are important and
necessary - Taxon Names are independent from Taxon concepts
- Agreement that observations/identifications etc.
should record concepts not names
24TCS
- XML based exchange schema
- Not designed as the correct way to model a
Taxon Concept - No rules as to what a taxon must have
- certain things needed to be useful
- Design to accommodate different ways concepts
described - Lots of optionality or flexibility in elements
- to address different work practices in the
community - Includes Taxon Names
- are more constrained as they are governed the
codes of nomenclature
25TCS
- Considerable debate on what should be top level
elements - Related closely to the question
- What gets a GUID?
- Taxon concepts
- Taxon Names
- Specimens
- Publications
- Taxon Relationship Assertions
- Concepts refer to Names
- Names must not change
- Cant record original taxon concept
26Exchange of Data
- Exchange of definitional data
- name definition
- information on history of name and type specimen
and publication details - taxon concept definition
- Name, publication details for the defining
source, characters, specimens, related taxa etc - Exchange of usage data
- for observations/lists (should only use taxon
concepts) - need only exchange references to existing taxon
concepts - user readable keys, e.g. Full Scientific name
according to Author Publication - GUIDs
- for name checking purposes
- need only exchange name without history or
typification - user readable keys, e.g. Full Scientific name
- GUIDs
27Issues of GUIDs for integration
- What gets a GUID?
- TCS top level elements??
- The physical thing or electronic record of the
thing - What is data and what is metadata associated with
the GUID? - Depends on your perspective on life..
- Stability of data associated with a GUID
- Who issues GUIDs?
- Centralised authority of some sort peer
review?? - One GUID per concept or name (no duplicates)
- ensure business rules are applied to new
names/concepts created - - bottleneck?
- - too restrictive in what the business rules
might be - Distributed free for all
- Anyone can publish their own name/concept and
get a GUID - - Mess of GUIDs to sort out
- Which technology?
- LSIDs, DOI etc.
28TCS and SEEK and
- Taxon Object Server
- Core of concept/name resolution service
- Kansas team has been implementing the TOS
- Schema based on the TCS model
- Tool to import data from TCS documents
- EML
- Proposed modifications to EML to accommodate
SEEK's taxonomic resolution services in the
future - User interface tools
- Uses cut down TCS as input format
- Inform other biology meta-data standards on
taxonomic issues - Cataloguing the complete genome standard
29Taxonomic Object Server
- TOS Allows
- registration, retrieval, integration of datasets
- Matches concepts given names, other concepts and
taxonomies - Allow taxonomists to
- Author new ideas
- Make new relationships between concepts
- Allow researchers to
- Easily see previous taxonomic opinions
- Use a stable identification system to reference
concepts (LSIDs) - Find concepts
- Integration with Kepler
30TOS operations
- Via TCS document
- addConcept
- addRelationship
- Public APIs
- getConcept on GUID
- getBestConcept on name string
- getHigherTaxon on GUID and authority up tree
- getAuthoritativeList down tree
- findConcepts on any property(s)
- findRelatedConcepts on GUID and relationships
- getSynonymousNames returns name strings
- getHigherTaxon
- getAuthoritativeList
- Dictionary for name-concept matching
- N-gram matching algorithm
- getBestConcept