Investigating the Semantic Gap - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Investigating the Semantic Gap

Description:

apple ipod nano review. sony plasma tv review. jerry yang biography. biography tim berners lee ... apple ipod nano review. entity. fix. type: product - 17 ... – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 26

Provided by: amitk

Category:

more less

Transcript and Presenter's Notes

Title: Investigating the Semantic Gap

1
Investigating the Semantic Gap
Peter Mika (Yahoo! Research) Joint work with
Edgar Meij (University of Amsterdam) Hugo
Zaragoza (Yahoo! Research)
2
Lots of data or very little?
Linked Data cloud (Mar, 2009)
3
Lots of data or very little?
Percentage of URLs with embedded metadata in
various formats (Sep, 2008-Mar, 2009)
4
The Semantic Gap

The real question is whether this data serves a
purpose
Our purpose fulfilling the information needs of
our users
Its not about the size! Consider Wikipedia.

Demand needs
Supply information
5
A problem with multiple layers

The Data Gap is the data on the Semantic Web
matches the information needs of consumers?
Is the data actually there?
The Vocabulary Gap are the vocabularies/ontologie
s on the Semantic Web match the language of
consumers?
Would we be able to understand it?

6
The Data Gap
7
Analysis of Semantic Web data through query logs

Research questions
How much of this data would ever be encountered
by a user through search?
What categories of queries can be answered?
Whats the role of large sites?
Method
Imitating the average search behavior of users
through web search query log analysis
Caveats
Search document search
Assume current bag-of-words document retrieval is
a reasonable approximation of semantic search
Search web search
A particular data (set) may be of particular
value to a particular user

8
SearchMonkey
hCard hReview
9
Analysis

Data
Microformats, eRDF, RDFa
Query log data
US query log
Random sample of 7k queries
Recent query log covering over a month period
Query classification data
US query log
1000 queries classified into various categories
Reproducible experiments (given data)
BOSS web search API
Returns metadata for search result URLs in
RDF/XML or DataRSS
Some programming )

10
Number of queries with a given number of results
with particular formats (N7081)

Note
Queries with 0 results with metadata not shown
You cannot add columns a query may return
documents with different formats
Assume queries return more than 10 results

Impressions
Average impressions per query
11
The influence of head sites (N7081)
Impressions
Average impressions per query
12
By query category local queries (N129)
Impressions
Average impressions per query
13
Future work

Usefulness
for display
for ranking
for disambiguation
Analysis on the level of types
Analysis on the level of properties
Which properties are most likely to contain the
query terms?
Evaluation of Linked Data
Will require human assessments

14
The Vocabulary Gap
15
Layer 2 analysis of vocabularies

What are the aspects of (types of) objects that
people are interested in?
And how does that relate to what is actually
modeled in ontologies?
Observation the same type of objects often have
the same query context
Users asking for the same aspect of the type

apple ipod nano review sony plasma tv
review jerry yang biography biography tim
berners lee tim berners lee blog peter mika
yahoo britney spears shaves her head
16
Models

Desirable properties
P1 Fix is frequent within type
P2 Fix has frequencies well-distributed across
entities
P3 Fix is infrequent outside of the type
Models

type
product
entity
fix
apple ipod nano review
17
Models cont.
18
Evaluation by query prediction

Three days of UK query log for training, three
days of testing
Entity-based frequency as baseline (gossip
server)
Measures
Recall at K, MRR (also per type)
Variables
models (M1-M6)
number of fixes (1, 5, 10)
mapping (templates vs. categories)
type to use for a given entity
Random
Most frequent type
Best type
Combination
To do number of days of training

19
Most common fixes by entity
20
Results query prediction success rate
21
Results query prediction success rate (binned)
22
Semantic Search Assist tool
23
Qualitative assessment

Five templates of varying sizes
settlements 43225
musical artists 24285
drugs 2321
football club 628
information appliances 82

Top ten prefixes and postfixes using our model M5
and Wikipedia templates as classes
24
Observations

Very few real factual questions where the answer
would be a single number or literal
Or are they just underspecified queries?
Some questions where the answer is a paragraph or
section from the Wikipedia page, e.g. aspirin
overdose
Possible combination of document search,
structured search and semantic search
Automated analysis of vocabulary gap?

25
Closing remarks