Title: Concepts, Ontologies, and Project TANGO
1Concepts, Ontologies, and Project TANGO
- Deryle Lonsdale
- BYU Linguistics and English Language
- lonz_at_byu.edu
2Outline
- NSF projects
- Semantic Web
- Concepts
- Project TIDIE
- Ontologies
- Project TANGO
- Tables
- Ontology generation
3Acknowledgements
- NSF
- David Embley (BYU CS), Steve Liddle (BYU Marriott
School) and Yuri Tijerino - BYU Data Extraction Group members
4The National Science Foundation
- Federal agency, 5.5 billion budget, funds 20 of
all federally supported basic research conducted
by Americas colleges and universities - 7 directorates
- Biological Sciences, Computer and Information
Science and Engineering, Engineering,
Geosciences, Mathematics and Physical Sciences,
Social, Behavioral and Economic Sciences, and
Education and Human Resources - 200,000 scientists, engineers, educators and
students at universities, laboratories and field
sites - 10,000 awards/year, 3 years duration (avg.)
5The NSF Nifty 50 (general)
- ACCELERATING, EXPANDING UNIVERSE
- ANTARCTIC OZONE HOLE RESEARCH
- ARABIDOPSISA PLANT GENOME PROJECT
- BAR CODES
- BLACK HOLES CONFIRMED
- BUCKY BALLS
- COMPUTER VISUALIZATION TECHNIQUES
- DATA COMPRESSION TECHNOLOGY
- DISCOVERY OF PLANETS
- DOPPLER RADAR
- EFFECTS OF ACID RAIN
- EL NIÑO AND LA NIÑA PREDICTIONS
- FIBER OPTICS
- GEMINI TELESCOPES
- HANTAVIRUS IDENTIFICATION
- DNA FINGERPRINTING
- MRIMAGNETIC RESONANCE IMAGING
- NANOTECHNOLOGY
- THE NATIONAL OBSERVATORIES
- OVERCOMING HEAVY METALS
- OVERCOMING SALT TOXICITY
- TISSUE ENGINEERING
- TUMOR DETECTION
- VOLCANIC ERUPTION DETECTION
- YELLOW BARRELS
6Language-related Nifty 50
- AMERICAN SIGN LANGUAGE DICTIONARY DEVELOPMENT
- COMPUTER VISUALIZATION TECHNIQUES
- THE DARCI CARD
- DATA COMPRESSION TECHNOLOGY
- THE "EYE CHIP" OR RETINA CHIP
- THE INTERNET
- PERSONS WITH DISABILITIES ACCESS TO THE WEB
- PROJECT LISTEN
- SPEECH RECOGNITION TECHNOLOGY
- vBNSVERY HIGH SPEED BACKBONE NETWORK SYSTEM
- WEB BROWSERS
7Browsing the Semantic Web
8Browsing the Semantic Web
9Desirable, not (yet) possible
- Word sense disambiguation
- Other types of queries (e.g. services)
- What is the cheapest available round-trip flight
to Cancun the day after finals this semester? - Set up an appointment with my optometrist for
next week. - List available 4-person BYU-approved apartments
in Orem for under 150/month. - Find me a linguistics job in Tahiti.
10Project TIDIE
- Apr 10, 2001 May 12, 2005
11Overview of TIDIE
- 3-year NSF project at BYU
- Total amount about 430,000
- PI David Embley (BYU CS), 4 co-PIs from BYU
- 18 grad students, 45 publications
- Demos, tools, papers, presentations at website
(www.deg.byu.edu/)
12Goal of TIDIE
- Target-Based Independent-of-Document Information
Extraction - Target-based user specifies what to find
- Not just keyword search, but concept-based search
using an ontology - Document independent
- Should work even if pages change over time, on
new documents - IE match, merge, retrieve, format information
- Present in way that user can search, query results
13Document-based IE
14Recognition and extraction
15Concepts
- What drive the matching process for IE
- Inherent in words, numbers, phrases, text
- Linguistics lexical semantics
- Denotations entities, attributes
- Location relationships
- Occurrences constraints
16Concept matching
- We use exhaustive concept matching techniques to
find concepts in documents including - Lexical information (lexicons)
- Natural language processing (NLP) techniques
- Similarity of values
- Features of value
- Data frames
- Constraints
17Lexicons
- Repositories of enumerable classes of lexical
information - FirstNames, LastNames, USStates, ProvoOremApts,
CarMakes, Drugs, CampGroundFeats, etc. - WordNet (synonyms, word senses,
hypernyms/hyponyms)
18The data-frame library
- Snippets of real-world knowledge about data
(type, length, nearby keywords, patterns as in
regexps, functional relations, etc) - Low-level patterns implemented as regular
expressions - Match items such as email addresses, phone
numbers, names, etc. - Mileage matches 8
- constant extract "\b1-9\d0,2k"
substitute "kK" -gt "000" , - extract "1-9\d0,2?,\d3"
- context "\\d1-9\d0,2?,\d3\d"
substitute "," -gt "", - extract "1-9\d0,2?,\d3"
- context "(mileage\\s)\\d1-9\d0,2
?,\d3\d" substitute "," -gt "", - extract "1-9\d3,6"
- context "\\d1-9\d3,6\smi(\
.\b\les\b)", - extract "1-9\d3,6"
- context "(mileage\\s)\\d1-9
\d3,6\b" - keyword "\bmiles\b", "\bmi\.", "\bmi\b",
"\bmileage\b" - end
19Isolated concepts are OK, but...
- Were also interested in the relations between
concepts - This is often best done graphically
- Ontology arrangement of concepts that
explicitizes their relations, constraints - Conceptual modeling field of CS / linguistics
that deals with formalizing concepts, using such
information - BYU has its own well-known conceptual modeling
framework (OSM)
20Conceptual modeling (OSM)
21Ontologies and IE
Source
Target
22Constant/keyword recognition
'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles.
Previous owner heart broken! Asking only
11,995. 1415. JERRY SEINER MIDVALE, 566-3800
or 566-3888
Descriptor/String/Position(start/end)
Year9723 MakeCHEV58 MakeCHEVY59 ModelCav
alier1118 FeatureRed2123 Feature5
spd2630 Mileage7,0003842 KEYWORD(Mileage)mil
es4448 Price11,995100105 Mileage11,9951001
05 PhoneNr566-3800136143 PhoneNr566-38881481
55
23Database instance generator
Year9723 MakeCHEV58 MakeCHEVY59 ModelCav
alier1118 FeatureRed2123 Feature5
spd2630 Mileage7,0003842 KEYWORD(Mileage)mil
es4448 Price11,995100105 Mileage11,9951001
05 PhoneNr566-3800136143 PhoneNr566-38881481
55
insert into Car values(1001, 97, CHEVY,
Cavalier, 7,000, 11,995,
556-3800) insert into CarFeature values(1001,
Red) insert into CarFeature values(1001, 5
spd)
24Car ads extraction ontology
25Car ads ontology (textual)
- Car -gtobject
- Car 0..1 has Year 1..
- Car 0..1 has Make 1..
- Car 0...1 has Model 1..
- Car 0..1 has Mileage 1..
- Car 0.. has Feature 1..
- Car 0..1 has Price 1..
- PhoneNr 1.. is for Car 0..
- PhoneNr 0..1 has Extension 1..
- Year matches 4
- constant extract \d2
- context "(\\d)4-9\d
\d" - substitute "" -gt "19" ,
-
-
- End
26A gene ontology
27A geneology data model
28Finding jobs in linguistics
- Built ontology for linguistics jobs what defines
a linguistics job - Data frames and lexicons language names
(www.ethnologue.com), subfields of linguistics
(www.linguistlist.org), tools linguists use,
programming languages, activities,
responsibilities, country names - Documents 3500 web pages emails to me
- Complete results reported in DLLS 2003
29Sample query
30Sample output
31Subfield expertise sought
32Technical skills sought
33Sample observations
- 270 dont have linguist (!)
- Computer/computational background required for
almost 1/3 (1116) - Noticeable amount of headhunting, particularly in
Seattle, DC areas - Often a job title is not even listed (!)
- Great need for ontologies related to linguistics
- job titles
- theoretical frameworks, subfields
- typical linguist job activities
- linguistic research/development venues
34An engineering discipline?
- 160 linguistics jobs ending in engineer
- Software development cycle
- research e., software design e.
- development e., software e.
- software quality e., linguistic test e.,
linguistic quality e. - linguistic support e., user experience e.
- presales e., technical sales e.
- Specific subfields
- web site e.
- speech e., voice recognition e., speech
recognition application e., speech e., ASR
tuning e., audio e. - dialog e.
- tools e.
- AI e., NLP e.
- knowledge e., ontology e.
- linguist e., natural language e.
- staff e.
- human factors e., user interface e.
35A recent ontologist job ad
- Date Thu, 28 Jul 2005 114440
- Subject General Linguistics Ontologist, Denver,
USA - Job Rank Ontologist
- Specialty Areas General Linguistics
- Position Summary Ontologist
- This person will be responsible for modifying
editing Ontology structures. - Skills
- Basic computer skills such as Internet, email,
and spreadsheet programs - In-depth knowledge of any major industry, such as
Health Care, Automotive, Legal, Construction,
and so forth helpful - Superior communication skills, both oral and
written. Ability to communicate effectively with
reports, peers, superiors, and customers
essential - Travel /or foreign language experience desired
- Personal Characteristics
- A healthy sense of logic, and a love for details
- A deep and abiding love of language, and of
rule-governed classification systems. This
person should be excited by the challenge of
figuring out the precise place where a word
belongs, and be delighted with the prospect of
performing such tasks as the major part of their
job - Position Qualifications
- -Bachelor's degree, preferably in Linguistics,
Library Science, English, or related field
36Another recent ontologist ad
- Position Summary Lead Ontologist
- The Lead Ontologist will be responsible for
creating designing Ontology and Ontology
structures. This person will be responsible for
innovation and general Ontology development as
Ontology requirements change. They will serve as
Team Lead on various Ontology projects, and they
will assist the Director with certain aspects of
management, including the development of
department culture and standards. They will also
serve as a liaison between the Director and the
rest of the team. - Skills
- Ability to edit manipulate text highly desired,
using tools such as Emacs and Perl. High level
programming language experience and SQL also
desired - Knowledge of Ontology structures, and experience
with developing and maintaining such structures - Ability to assist with Ontology development and
use problem-solving skills to overcome obstacles - Ability to QA own Ontology work, and work of
others - Ability to lead projects from set-up through to
QA - Leadership or management experience a plus
- Position Qualifications
- -Bachelor's degree in Linguistics, Library
Science, or related field - -2-3 years experience in Ontology or related
field - Application Deadline Open until filled.
37Matching request with ontology
- Tell me about cruises on San Francisco Bay. Id
like to know scheduled times, cost, and the
duration of cruises on Friday of next week.
38Building a query
Friday, Oct. 29th
cost
duration
?
?
Result
(
)
39StartTime Price Duration Source
1045 am, 1200 pm, 115, 230, 400 20.00, 16.00, 12.00 1
1000 am, 1045 am, 1115 am, 1200 pm, 1230 pm, 115 pm, 145 pm, 230 pm, 300 pm, 345 pm, 415 pm, 500 pm 17.00, 16.00, 12.00 1 Hour 2
40Another example
- Service Request
- Match with Task Ontology
- Domain Ontology
- Process Ontology
- Complete, Negotiate, Finalize
I want to see a dermatologist next week any day
would be ok for me, at 400 p.m. The
dermatologist must be within 20 miles from my
home and must accept my insurance.
41Service domain ontology
42?
?
?
?
?
?
43Relevant mini-ontology
44Ontologies issues
- Most successful in data-rich, narrow- domain
applications - Ambiguities are problematic, context only
partially eliminates - Incompleteness implicit information
- Commonsense world pragmatics evasive
- Knowledge prerequisites are steep
- Major efforts in creation, maintenance
- Must be created by experts
- Experts are biased in knowledge, agreement needed
- Ontologies continually change upkeep a massive
task
45Ontologies possible solutions
- Some automation is needed
- Current automatic generation of ontologies is not
successful, because extracted from free-form,
unstructured text. - A more effective alternative is to extract
ontologies from structured data on the web
(tables, charts, etc.) - TANGO project
- Part 1 Extract tables from the web
- Part 2 Define mini-ontologies from tables
- Part 3 Merge into growing domain ontology
46Project TANGO
47Overview
- Table ANalysis for Generating Ontologies
- 3-year NSF-funded project
- Joint BYU/RPI project
- Uses and extends TIDIE concepts, ontologies
- Goal is to process tables, generate ontologies,
use results for IE
48Motivation
- Keyword or link analysis search not enough to
search for information in tables - Structure in tables can lead to domain knowledge
which includes concepts, relationships and
constraints (ontologies) - Tables on web created for human use can lead to
robust domain ontologies
49Table understanding
- What is a table?
- Why table normalization?
- What is table understanding?
- What is mini-ontology generation?
50What is a table?
- a two-dimensional assembly of cells used to
present information - Lopresti and Nagy
- Normalized tables (row-column format)
- Small paper (using OCR) and/or electronic tables
(marked up) intended for human use
51?
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3
.7 Focal Length min 6.3 mm Focal Length
max 63.0 mm
52?
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3
.7 Focal Length min 6.3 mm Focal Length
max 63.0 mm
53?
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3
.7 Focal Length min 6.3 mm Focal Length
max 63.0 mm
54?
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3.7
Focal Length min 6.3 mm Focal Length max 63.0 mm
55Digital Camera
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3
.7 Focal Length min 6.3 mm Focal Length
max 63.0 mm
56?
Flight Class From Time/Date To
Time/Date Stops Delta 16 Coach JFK
605 pm CDG 735 am 0
02 01 04
03 01 04 Delta 119 Coach CDG
1020 am JFK 100 pm 0
09 01 04
09 01 04
57?
Flight Class From Time/Date To
Time/Date Stops Delta 16 Coach JFK
605 pm CDG 735 am 0
02 01 04
03 01 04 Delta 119 Coach CDG
1020 am JFK 100 pm 0
09 01 04
09 01 04
58Airline Itinerary
Flight Class From Time/Date To
Time/Date Stops Delta 16 Coach JFK
605 pm CDG 735 am 0
02 01 04
03 01 04 Delta 119 Coach CDG
1020 am JFK 100 pm 0
09 01 04
09 01 04
59?
Place Bonnie Lake County Duchesne State Utah Typ
e Lake Elevation 10,000 feet USGS Quad Mirror
Lake Latitude 40.711ºN Longitude 110.876ºW
60?
Place Bonnie Lake County Duchesne State Utah Typ
e Lake Elevation 10,000 feet USGS Quad Mirror
Lake Latitude 40.711ºN Longitude 110.876ºW
61?
Place Bonnie Lake County Duchesne State Utah Typ
e Lake Elevation 10,000 feet USGS Quad Mirror
Lake Latitude 40.711ºN Longitude 110.876ºW
62Maps
Place Bonnie Lake County Duchesne State Utah Typ
e Lake Elevation 10,100 feet USGS Quad Mirror
Lake Latitude 40.711ºN Longitude 110.876ºW
63Table normalization
Raw table
- take any table, produce a standard row-column
table with all data cells containing expanded
values and type information
Country GDP/PPP GDP/PPP Per Capita Real- Growth Rate Inflation
Afghanistan 21,000,000,000 800 ? ?
Albania 13,200,000,000 3,800 7.3 3.0
Algeria 177,000,000,000 5,600 3.8 3.0
Andorra 1,300,000,000 19,000 3.8 4.3
Angola 13,300,000,000 1,330 5.4 110.0
Antigua and Barbuda 674,000,000 10,000 3.5 0.4
Normalized table
64Normalizing across hyperlinks
65Normalized table
?? Population Population Growth rate Population Density Birth Rate Death Rate Migration Rate Life Expectancy Male Life Expectancy Female Infant Mortality
Afghanistan 25,824,882 3.95 39.88 persons/km2 4.19 1.70 1.46 47.82 years 46.82 years 14.06
Albania 3,364,571 1.05 122.79 persons/km2 2.07 0.74 -0.29 65.92 years 72.33 years 4.29
Algeria 31,133,486 2.10 13.07 persons/km2 2.70 0.55 -0.05 68.07 years 70.46 years 4.38
American Samoa 63,786 2.64 320.53 persons/km2 2.65 0.40 0.39 71.23 years 79.95 years 1.02
Andorra 65,939 2.24 146.53 persons/km2 1.03 0.55 1.76 80.55 years 86.55 years 0.41
Angola 11,510 2.84 8.97 persons/km2 4.31 1.64 0.16 46.08 years 50.82 years 12.92
Western Sahara 239,333 2.34 0.90 persons/km2 4.54 1.66 -0.54 47.98 years 50.57 years 13.67
World 5,995,544,836 1.30 14.42 persons/km2 2.20 0.90 ? 61.00 years 65.00 years 5.60
Yemen 16,942,230 3.34 32.09 persons/km2 4.33 0.99 0.00 58.17 years 61.88 years 6.98
Zambia 9,663,535 2.12 13.05 persons/km2 4.45 2.26 0.08 36.72 years 37 21 years 9.19
Zimbabwe 11,163,160 1.02 28.87 persons/km2 3.06 2.04 ? 38.77 years 38.94 years 6.12
66How to understand tables
- Captions in vicinity of table (above, below
etc) - Footnotes on annotated column labels or data
cells - Embedded information in rows, columns or cells
e.g., , , (1,000), billions, etc - Links to other views of the table, possibly with
new information
67Use of normalized data
- Take a table as an input and produce standard
records in the form of attribute-value pairs as
output - Discover constraints among columns
- Understand the data values
ltCountry Afghanistangt, ltGDP/PPP
21,000,000,000gt, ltGDP/PPP per capita 800gt,
ltReal-growth rate ?gt, ltInflation ?gt
has(Country, GDP/PPP),has(Country,GDP/PPP Per
Capita), has(Country,Real-growth rate),
has(Country, Inflation)
Left-most, primary key
Country GDP/PPP GDP/PPP Per Capita Real-Growth Rate Inflation
Afghanistan 21,000,000,000 800 ? ?
Albania 13,200,000,000 3,800 7.3 3.0
Algeria 177,000,000,000 5,600 3.8 3.0
Andorra 1,300,000,000 19,000 3.8 4.3
Angola 13,300,000,000 1,330 5.4 110.0
Antigua and Barbuda 674,000,000 10,000 3.5 0.4
Dollar amount (from data frame)
Percentage (from data frame)
Country names (from data frame)
68Ontology generation overview
69ExampleCreating a domain ontology
Longitude
Latitude
Latitude and longitude designates location
Distances
Name
Geopolitical Entity
Location
Includes procedural knowledge
has
names
Has GMT
Duration between Time zones
Time
Country
City
Has associated data frames
70ExampleTable understanding to mini-ontology
generation
Agglomeration Population Continent Country
Tokyo 31,139,900 Asia Japan
New York-Philadelphia 30,286,900 The Americas United States of America
Mexico 21,233,900 The Americas Mexico
Seoul 19,969,100 Asia Korea (South)
Sao Paulo 18,847,400 The Americas Brazil
Jakarta 17,891,000 Asia Indonesia
Osaka-Kobe-Kyoto 17,621,500 Asia Japan
Niigata 503,500 Asia Japan
Raurkela 503,300 Asia India
Homjel 502,200 Europe Belarus
Zunyi 501,900 Asia China
Santiago 501,800 The Americas Dominican Republic
Pingdingshan 501,500 Asia China
Fargona 501,000 Asia Uzbekistan
Kirov 500,200 Europe Russia
Newcastle 500,000 Australia /Oceania Australia
71ExampleConcept matching to ontology Merging
Merge
Results
Has GMT
Has GMT
72Ontology merging/growing
- Direct merge (no conflicts)
- Use results of matching phase to find similar
concepts in ontologies (e.g., data value
similarities, data frames, NLP, etc) - Conflict resolution
- Interactively identify evidence and counter
evidence of functional relationships among
mini-ontologies using constraint resolution - IDS Interaction with human knowledge engineer
- Issues identify
- Default strategy apply
- Suggestions make
73Example Another mini-ontology generation
74Example Another mini-ontology generation
Merge
Longitude
Latitude
Population
Latitude and longitude designates location
Location
Name
Geopolitical Entity
has
names
has GMT
Time
City
Agglomeration
Country
Continent
75Example Concept Mapping to Ontology Merging
Longitude
Latitude
Population
Latitude and longitude designates location
Location
Name
Geopolitical Entity
has
names
has GMT
Time
Geopolitical Entity with population
Elevation
USGS Quad
State
Place
Area
?
Country
Lake
Agglomeration
Country
Continent
City/town
Mine
Reservoir
76Recognize Table Information
Religion
Population Albanian
Roman Shia
Sunni Country (July 2001 est.) Orthodox
Muslim Catholic Muslim Muslim
other Afganistan 26,813,057
15
84 1 Albania
3,510,484 20 70 30
77Construct Mini-Ontology
78Discover Mappings
79Merge
80Review the TANGO process
- Start out with normalized table
- Generate likely candidates for
- Object Sets
- Relationship Sets
- Functional Constraints
- Inclusion Constraints/Hierarchical Structure
- Get help from user when needed
- Choose best candidate for the ontology
81Generate concepts
Create list of candidate concepts (usually column
names)
82Example 1 Generate Concepts
Determine lexicalization (columns with associated
values are lexical)
83Example 1 Generate Concepts
Current ontology
84Example 1 Generate Relationships
- Decide relationship sets
- Exponential number of combinations
- Basic assumption one main concept relates to all
others (attributes) - Goal find central column of interest
85Example 1 Generate Relationships
Look for mapping between one column and title of
table
86Example 1 Generate Relationships
Current ontology
87Example 1 Generate Constraints
- FDs and Participation Constraints
- FD definition X ? Y iff (Xi Xj) ? (Yi
Yj) for all row indexes i and j. - Unless solid case (two or more same values), only
consider FDs from central object to attributes - Use heuristics for setting exact participation
(01,1, etc)
88Example 1 Generate Concepts
Numerical values are usually functionally
determined by column of interest and have 0
participation constraint.
89Example 1 Generate Constraints
Completed mini-ontology
90Example 2 Generate Concepts
- SubFamily, Group, and SubGroup are generic types
- Enumerate column values as object sets because
less than 5 divisions (recursively)
91Example 2 Generate Relationships
- Found mapping of central column of interest to
title (Language) - Exceptions to basic assumption
- Hierarchy (enumerated object sets)
- Transitive FDs (X ? Y, Y ? Z, remove X ? Z)
- Create ISA hierarchy from table structure
92Example 2 Generate Relationships
Current ontology
93Example 2 Generate Hierarchical Constraints
- Assign members to each object set for easy
calculation - Find inclusion dependencies
- Union All members of parents are members of one
or more child - Intersection (Less common) Child members are
always in both parents - Mutual exclusion Intersection of any two child
members is empty.
94Example 2 Generate Hierarchical Constraints
Completed mini-ontology
95Future direction
- Start with multiple tables (or URLs) and generate
mini-ontologies - Identify most suitable mini-ontologies to merge
by calculating which tables have most overlap of
concepts - Generate multiple domain ontologies
- Integrate with form-based data extraction tools
(smarter Web search engines)