Title: Ontologies at Your Service
1Ontologies at Your Service
- Yigal Arens
- Eduard Hovy
- USC/ISI
2DGRC
- Purpose Make Digital Government Happen!
- Advance information systems research
- Bring the benefits of cutting edge IT research to
government systems - Help educate government and the community
3The problem and the solution
- Problem FedStats brings together thousands of
databases from over seventy Federal agencies - data is duplicated and near-duplicated,
- even government personnel have trouble finding
and interpreting one anothers data!
Research challenge Provide access to multiple
databases, for both sophisticated and casual
users, in an easy-to-use and easy-to-understand
manner, without distorting the data
- Solution Create a framework that can provide
easy, fast, and/or standardized access - need method of standardizing many databases,
- need multi-database access engine,
- need powerful user interface.
4Why use an ontology?
Ontology taxonomized set of terms with
definitions and axioms, used by humans,
databases, and systems.
- Cognitive Reasons
- Investigate human knowledge organization.
- Build platform for human processes reminding,
generalization, learning. - System Building Reasons
- Standardize terminology avoid inconsistency.
- Assist knowledge transfer link data across
domains. - Facilitate interoperability let systems work in
new domains.
5SENSUS two uses
lobster
Buch
tournee
Klavier
livre
Käse
Plug in domain models and databases multi-DB
access
?????
cheese
bench
fromage
Link to words of different languages translation
6ISIs DINO ontology
http//edc.isi.edu8011/dino
- Taxonomy, multiple superclass links
- Approx. 90,000 concepts
- Top level Penman Upper Model (ISI)
- Body WordNet 1.6 (Princeton), rearranged
- New information being added by text mining
- Used at ISI for machine translation, text
summarization, database access
7Projects Described Today
- Energy Data Collection (EDC)
- Access to distributed statistical data
81. EDC Project
9EDC Access to gasoline data
- Government partners
- Energy Information Administration (EIA)
- Bureau of Labor Statistics (BLS)
- Census Bureau
- (also data from California Energy Commission)
- Central problems attacked
- Proliferation of terminology
- Difficulty requesting and interpreting data
- Need to integrate data from autonomous sources
- Current databases and models
- SENSUS ontology 90,000 nodes (from ISIs NLP
technology) - Domain model 500 nodes (manual for database
access planner) - LKB 6000 nodes (NL term/info extraction from
glossaries) - Databases 58,000 series (EIA OGIRS and others)
- Webpages 60 (BLS, CEC tables)
10The idea behind SIMS
Sources
- There are many types of data sources databases,
pdf files, text files, html files... - The user doesnt want to know this!
- Solution
1. Wrap each source in software that handles
access to its data 2. Record the types of info
in each source in a source model 3. Arrange all
source models together in the same spacethe
Domain Model 4. Use a data access planner (SIMS)
to transform a users request for data into a set
of individual access queries that extracts the
right data from the appropriate sources
Models
11A super domain model the ontology
?
?
?
?
?
?
?
?
?
?
?
http//edc.isi.edu8011/dino
12Extracting metadata from text
- Challenge Extend the ontology to cover domain
models. Try doing this automatically, by
extracting useful terms from text associated with
data - Problems
- Proliferation of terms in domain
- Agencies define terms differently
- Many refer to the same or related entity
- Lengthy term definitions often bury important
information - Example input
- Motor Gasoline Blending Components Naphthas
(e.g., straight-run gasoline, alkylate,
reformate, benzene, toluene, xylene) used for
blending or compounding into finished motor
gasoline. These components include reformulated
gasoline blendstock for oxygenate blending (RBOB)
but exclude oxygenates (alcohols, ethers),
butane, and pentanes plus. Note Oxygenates are
reported as individual components and are
included in the total for other hydrocarbons,
hydrogens, and oxygenates.
Judith Klavans, Dir of CRIA, Columbia Deniz
Saros, grad student, Columbia
13Lexical Knowledge Base (LKB) Tool
- Combines statistical and linguistic methods
- identifies topics with high accuracy
- provides complete coverage
- useful for any subject area
- produced over 6,000 concepts in current domain
142. A Biology Polyclave
15Polyclave
- Challenge at NSF Biodiversity Infrastructure
workshop, field biologists asked for hand-held
polyclave - need to identify plant species in the field,
tramping through Colorado - workers sometimes not fully expert in tens of
thousands of varieties - existing polyclave built by someone, but
proprietary not hand-held - Experiment we used ISI knowledge rep (ontology)
technology to build a polyclave and populated it
with information from UC Davissee
http//vigor.isi.edu8888/
16(No Transcript)
17(No Transcript)
18Current ontology research directions
- Applications
- DGRC modeling and linking health data from NCHS
to SENSUS - Automated QA using SENSUS information in
Webclopedia, ISIs QA system (like AskJeeves, for
real) - Ontology construction research
- Investigating methods for automated ontology
construction, using statistical clustering
methods - Investigating methods for automated ontology
content acquisition, by extracting information
from online text
19Thank you!Any questions?