Title: Investigating the Semantic Gap
1Investigating the Semantic Gap
Peter Mika (Yahoo! Research) Joint work with
Edgar Meij (University of Amsterdam) Hugo
Zaragoza (Yahoo! Research)
2Lots of data or very little?
Linked Data cloud (Mar, 2009)
3Lots of data or very little?
Percentage of URLs with embedded metadata in
various formats (Sep, 2008-Mar, 2009)
4The Semantic Gap
- The real question is whether this data serves a
purpose - Our purpose fulfilling the information needs of
our users - Its not about the size! Consider Wikipedia.
Demand needs
Supply information
5A problem with multiple layers
- The Data Gap is the data on the Semantic Web
matches the information needs of consumers? - Is the data actually there?
- The Vocabulary Gap are the vocabularies/ontologie
s on the Semantic Web match the language of
consumers? - Would we be able to understand it?
6The Data Gap
7Analysis of Semantic Web data through query logs
- Research questions
- How much of this data would ever be encountered
by a user through search? - What categories of queries can be answered?
- Whats the role of large sites?
- Method
- Imitating the average search behavior of users
through web search query log analysis - Caveats
- Search document search
- Assume current bag-of-words document retrieval is
a reasonable approximation of semantic search - Search web search
- A particular data (set) may be of particular
value to a particular user
8SearchMonkey
hCard hReview
9Analysis
- Data
- Microformats, eRDF, RDFa
- Query log data
- US query log
- Random sample of 7k queries
- Recent query log covering over a month period
- Query classification data
- US query log
- 1000 queries classified into various categories
- Reproducible experiments (given data)
- BOSS web search API
- Returns metadata for search result URLs in
RDF/XML or DataRSS - Some programming )
10Number of queries with a given number of results
with particular formats (N7081)
- Note
- Queries with 0 results with metadata not shown
- You cannot add columns a query may return
documents with different formats - Assume queries return more than 10 results
Impressions
Average impressions per query
11The influence of head sites (N7081)
Impressions
Average impressions per query
12By query category local queries (N129)
Impressions
Average impressions per query
13Future work
- Usefulness
- for display
- for ranking
- for disambiguation
-
- Analysis on the level of types
- Analysis on the level of properties
- Which properties are most likely to contain the
query terms? - Evaluation of Linked Data
- Will require human assessments
14The Vocabulary Gap
15Layer 2 analysis of vocabularies
- What are the aspects of (types of) objects that
people are interested in? - And how does that relate to what is actually
modeled in ontologies? - Observation the same type of objects often have
the same query context - Users asking for the same aspect of the type
apple ipod nano review sony plasma tv
review jerry yang biography biography tim
berners lee tim berners lee blog peter mika
yahoo britney spears shaves her head
16Models
- Desirable properties
- P1 Fix is frequent within type
- P2 Fix has frequencies well-distributed across
entities - P3 Fix is infrequent outside of the type
- Models
type
product
entity
fix
apple ipod nano review
17Models cont.
18Evaluation by query prediction
- Three days of UK query log for training, three
days of testing - Entity-based frequency as baseline (gossip
server) - Measures
- Recall at K, MRR (also per type)
- Variables
- models (M1-M6)
- number of fixes (1, 5, 10)
- mapping (templates vs. categories)
- type to use for a given entity
- Random
- Most frequent type
- Best type
- Combination
- To do number of days of training
19Most common fixes by entity
20Results query prediction success rate
21Results query prediction success rate (binned)
22Semantic Search Assist tool
23Qualitative assessment
- Five templates of varying sizes
- settlements 43225
- musical artists 24285
- drugs 2321
- football club 628
- information appliances 82
Top ten prefixes and postfixes using our model M5
and Wikipedia templates as classes
24Observations
- Very few real factual questions where the answer
would be a single number or literal - Or are they just underspecified queries?
- Some questions where the answer is a paragraph or
section from the Wikipedia page, e.g. aspirin
overdose - Possible combination of document search,
structured search and semantic search - Automated analysis of vocabulary gap?
25Closing remarks
- A Semantic Web dashboard?
- Evaluation campaign for semantic search in Web
IR? - Web search track?
- Feedback is welcome
- pmika_at_yahoo-inc.com