Investigating the Semantic Gap - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Investigating the Semantic Gap

Description:

apple ipod nano review. sony plasma tv review. jerry yang biography. biography tim berners lee ... apple ipod nano review. entity. fix. type: product - 17 ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 26
Provided by: amitk
Category:

less

Transcript and Presenter's Notes

Title: Investigating the Semantic Gap


1
Investigating the Semantic Gap
Peter Mika (Yahoo! Research) Joint work with
Edgar Meij (University of Amsterdam) Hugo
Zaragoza (Yahoo! Research)
2
Lots of data or very little?
Linked Data cloud (Mar, 2009)
3
Lots of data or very little?
Percentage of URLs with embedded metadata in
various formats (Sep, 2008-Mar, 2009)
4
The Semantic Gap
  • The real question is whether this data serves a
    purpose
  • Our purpose fulfilling the information needs of
    our users
  • Its not about the size! Consider Wikipedia.

Demand needs
Supply information
5
A problem with multiple layers
  • The Data Gap is the data on the Semantic Web
    matches the information needs of consumers?
  • Is the data actually there?
  • The Vocabulary Gap are the vocabularies/ontologie
    s on the Semantic Web match the language of
    consumers?
  • Would we be able to understand it?

6
The Data Gap
7
Analysis of Semantic Web data through query logs
  • Research questions
  • How much of this data would ever be encountered
    by a user through search?
  • What categories of queries can be answered?
  • Whats the role of large sites?
  • Method
  • Imitating the average search behavior of users
    through web search query log analysis
  • Caveats
  • Search document search
  • Assume current bag-of-words document retrieval is
    a reasonable approximation of semantic search
  • Search web search
  • A particular data (set) may be of particular
    value to a particular user

8
SearchMonkey
hCard hReview
9
Analysis
  • Data
  • Microformats, eRDF, RDFa
  • Query log data
  • US query log
  • Random sample of 7k queries
  • Recent query log covering over a month period
  • Query classification data
  • US query log
  • 1000 queries classified into various categories
  • Reproducible experiments (given data)
  • BOSS web search API
  • Returns metadata for search result URLs in
    RDF/XML or DataRSS
  • Some programming )

10
Number of queries with a given number of results
with particular formats (N7081)
  • Note
  • Queries with 0 results with metadata not shown
  • You cannot add columns a query may return
    documents with different formats
  • Assume queries return more than 10 results

Impressions
Average impressions per query
11
The influence of head sites (N7081)
Impressions
Average impressions per query
12
By query category local queries (N129)
Impressions
Average impressions per query
13
Future work
  • Usefulness
  • for display
  • for ranking
  • for disambiguation
  • Analysis on the level of types
  • Analysis on the level of properties
  • Which properties are most likely to contain the
    query terms?
  • Evaluation of Linked Data
  • Will require human assessments

14
The Vocabulary Gap
15
Layer 2 analysis of vocabularies
  • What are the aspects of (types of) objects that
    people are interested in?
  • And how does that relate to what is actually
    modeled in ontologies?
  • Observation the same type of objects often have
    the same query context
  • Users asking for the same aspect of the type

apple ipod nano review sony plasma tv
review jerry yang biography biography tim
berners lee tim berners lee blog peter mika
yahoo britney spears shaves her head
16
Models
  • Desirable properties
  • P1 Fix is frequent within type
  • P2 Fix has frequencies well-distributed across
    entities
  • P3 Fix is infrequent outside of the type
  • Models

type
product
entity
fix
apple ipod nano review
17
Models cont.
18
Evaluation by query prediction
  • Three days of UK query log for training, three
    days of testing
  • Entity-based frequency as baseline (gossip
    server)
  • Measures
  • Recall at K, MRR (also per type)
  • Variables
  • models (M1-M6)
  • number of fixes (1, 5, 10)
  • mapping (templates vs. categories)
  • type to use for a given entity
  • Random
  • Most frequent type
  • Best type
  • Combination
  • To do number of days of training

19
Most common fixes by entity
20
Results query prediction success rate
21
Results query prediction success rate (binned)
22
Semantic Search Assist tool
23
Qualitative assessment
  • Five templates of varying sizes
  • settlements 43225
  • musical artists 24285
  • drugs 2321
  • football club 628
  • information appliances 82

Top ten prefixes and postfixes using our model M5
and Wikipedia templates as classes
24
Observations
  • Very few real factual questions where the answer
    would be a single number or literal
  • Or are they just underspecified queries?
  • Some questions where the answer is a paragraph or
    section from the Wikipedia page, e.g. aspirin
    overdose
  • Possible combination of document search,
    structured search and semantic search
  • Automated analysis of vocabulary gap?

25
Closing remarks
  • A Semantic Web dashboard?
  • Evaluation campaign for semantic search in Web
    IR?
  • Web search track?
  • Feedback is welcome
  • pmika_at_yahoo-inc.com
Write a Comment
User Comments (0)
About PowerShow.com