Title: Automatic Creation and Simplified Querying of Semantic Web Content
1Automatic Creation and Simplified Querying of
Semantic Web Content
- An Approach Based on Information-Extraction
Ontologies
Yihong Ding, David W. Embley, and Stephen W.
Liddle Brigham Young University
2Fundamental Problems
- Lack of semantic web content
- Difficulty of content creation
- Inability to use semantic web content easily
3Proposed Solutions
- Automatically annotate data-rich web pages
(turning them into semantic web pages) - Provide for free-form, textual queries of
semantic web content
4A Show-Case Vision
Find me the price and mileage of red Nissans I
want a 1990 or newer.
5Demo I Data Extraction
6Demo II Semantic Annotation
7Demo III Free-Form Query
8Explanation How it Works
- Extraction Ontologies
- Semantic Annotation
- Free-Form Query Interpretation
9Extraction Ontologies
Object sets Relationship sets Participation
constraints Lexical Non-lexical Primary object
set Aggregation Generalization/Specialization
10Formalism Extraction Ontologies
(a quick side note)
- Fully formalized in predicate calculus
- Object set 1-place predicate
- N-ary relationship set n-place predicate
- Constraint closed predicate-calculus formula
- As a description logic ALCN (Attributive
Language with Complement and Numeric Restrictions)
11Extraction Ontologies
Data Frame
Internal Representation float
Values
External Rep. \s\s(\d1,3)(\.\d2)?
Left Context
Key Word Phrase
Key Words (Pprice)(Ccost)
Operators
Operator gt
Key Words (more\sthan)(more\scostly)
12Data-Extraction Results Car Ads
Salt Lake Tribune
Recall Precision Year 100 100 Make
97 100 Model 82 100 Mileage
90 100 Price 100 100 PhoneNr 94
100 Feature 91 99
Training set for tuning ontology 100 Test set
116
13Car Ads Comments
- Dynamic sets
- Missed MERC, Town Car, 98 Royale
- Could use lexicon of makes and models
- Unspecified variation in lexical patterns
- Missed 5 speed (instead of 5 spd), p.l (instead
of p.l.) - could adjust lexical patterns
- Misidentification of attributes
- Classified AUTO in AUTO SALES as automatic
transmission - Could adjust exceptions in lexical patterns
- Typographical errors
- Chrystler, DODG ENeon, I-15566-2441
- Could look for spelling variations and common
typos
14General Extraction Results
- 20 Domains (cars, obituaries, cameras, jobs,
games, prescription drugs, ) - Simple, unified domains nearly 100 recall and
precision - Complex, loosely defined domains (e.g.
obituaries 82 recall and 74 precision) - Typical 80 recall and precision
15Generality Resiliency ofExtraction Ontologies
(another quick side note)
- Assumptions about web pages (generality)
- Data rich
- Narrow domain
- Document types
- Simple multiple-record documents (easiest)
- Single-record documents (harder)
- Records with scattered components (even harder)
- Declarative (resiliency)
- Still works when web pages change
- Works for new, unseen pages in the same domain
- Scalable, but takes work to declare the
extraction ontology
16Semantic Annotation
17Free-Form Query Interpretation
- Parse Free-Form Query
(with data extraction ontology) - Select Ontology
- Formulate Query Expression
- Run Query Over Semantically Annotated Data
18Parse Free-Form Query
Find me the and of all
s I want a
price
mileage
red
Nissan
1996
or newer
gt Operator
19Select Ontology
Find me the price and mileage of all red Nissans
I want a 1996 or newer
Similarity value 5
Similarity value 2
20Formulate Query Expression
- Conjunctive queries and aggregate queries
- Mentioned object sets are all of interest in the
result. - Values and operator keywords determine
conditions. - Color red
- Make Nissan
- Year gt 1996
gt Operator
21Formulate Query Expression
For
Let
Where
Return
22Run QueryOver Semantically Annotated Data
23Query Interpretation ResultsPilot Experiment
with Car Ads
- 15 car-ads free-form queries from 3 volunteer CS
students - Results
- Recognizing object sets of interest
- Recall 85
- Precision 90
- Recognizing constraints
- Recall 61
- Precision 79
- Problems
- Regular expressions not tuned up and lexicons
incomplete - Ambiguities Are there any Ford mustangs, 2002,
that are red? (Is 2002 a year, mileage, or
price?) - Caveats
- No disjunction
- No negation
24GeneralQuery Interpretation Results
- AskOntos
- (Pilot Experiment on 5 domains cars, real
estate, countries, movies, diamonds) - Object sets of interest recognized
- Recall 90
- Precision 90
- Conditions recognized
- Recall 71
- Precision 88
25Pragmatics
All is not rosy
- Technical problems
- Extraction and query-interpretation accuracy
- Execution speed
- Harvesting
- Crawling?!
- Information behind forms on the hidden web
- Social problems
- Cooperation from web site developers
- End-user concerns
- Motivation
- Trust
26Conclusions
- Automatically create semantic-web content
- Do data extraction over an ordinary web page
- Create semantic-web page
- Cache page
- Store external semantic annotation wrt an
ontology - Query semantic web pages
- Free-form queries
- Return results
- Table
- Link to original web page (scrolled and
highlighted) - Pragmatic considerations
www.deg.byu.edu