Title: Characterizing Semantic Relatedness of Search Query Terms
1Characterizing Semantic Relatednessof Search
Query Terms
Dominik Benz, Beate Krause, Praveen Kumar,
Andreas Hotho, Gerd Stumme Research Unit
Knowledge and Data Engineering (KDE), University
of Kassel, Germany
2Where do Semantics come from?
- Semantically annotated content is the fuel of
the next generation Semantic Web but where is
the petrol station?
Derive Semantics from how users interact with
information!
Implicit interaction (Search engine clicklogs)
Explicit interaction (Collaborative Tagging)
3Agenda
- Logsonomy Introduction
- Dataset
- Query Term Similarity Measures
- Semantic Grounding
- Summary and Outlook
4Explicit annotation Folksonomies
-
- Folksonomies allow users to assign tags to
resources.
5Implicit annotation Search Engine clicklogs
(Logsonomies)
-
- By clicking on results, users assign query
terms to resources.
6Structural analogies between explicit implicit
annotation
Implicit
Explicit
Allow
users
Allow
users
to query
terms
to assign
tags
and click the
results
to
resources
Logsonomy
Folksonomy
- Formal model for both F (U, T, R, Y) where
- U, T, and R are finite sets, whose elements are
called users, tags and resources, - Y µ U T R, called set of tag assignments,
- Can also be seen as ternary relation, tripartite
hypergraph
7Put together the pieces
- Analysis of similarity measures between
Folksonomy Tags - able to extract synonyms, hypernyms
- Cattuto 2008
Apply Similarity Measures to Logsonomy Graph!
- Similar Network properties of Folksonomies and
Logsonomies - small world, clustering coefficient, cumulative
strength, .. - Krause 2008
8Agenda
- Logsonomy Introduction
- Dataset
- Query Term Similarity Measures
- Semantic Grounding
- Summary and Outlook
9Logsonomy Dataset
- AOL clicklog (March 2006)
- Users search engine user IDs
- Tags retrieved by splitting queries (using
whitespace) - Resources clicked URLs
- Excerpt 10,000 most often used query terms
- U 463,380 T 10,000 R
1,284,724 - Y 26,227,550
- Tag rank position in most-popular list
- 1 free
- 2 county
- 3 pictures
- 4 school
-
10Agenda
- Logsonomy Introduction
- Dataset
- Query Term Similarity Measures
- Semantic Grounding
- Summary and Outlook
11Similarity Measures Co-occurrence Tag Context
- Take Co-occurrence frequency as similarity
measure (coocc) - Describe each tag as a context vector
- each dimension of the vector space corresponds to
another tag (TagContext) - compute similar tags by cosine similarity
-
JAVA
design
software
blog
web
programming
12Similarity Measures User Resource Context
- Two further possible context dimensions
- Users (UserContext)
- Resources (ResourceContext)
- (TF-IDF weighting showed no great effect)
JAVA
John
Mary
Joe
Karl
Lucy
JAVA
lwa.de
java.sun.com
javadev.de
google.com
hacking.com
13Similarity Measures FolkRank
- Take Co-occurrence frequency as similarity
measure (freq). - Cosine Similarity between tag vectors
- Use FolkRank to find related tags (folkrank).
- Basic Idea PageRank-like spreading of weights
through folksonomy / logsonomy structure high
weights for a particular tag in the random surfer
vector
Web graph
Logsonomy / Folksonomy graph
Andreas Hotho and Robert Jäschke and Christoph
Schmitz and Gerd Stumme. Information Retrieval in
Folksonomies Search and Ranking. Proceedings of
the 3rd European Semantic Web Conference,
(4011)411-426, Springer,Budva, Montenegro,2006.
14Example Most related terms for guitar and
brain
BRAIN GUITAR
15Qualitative Insights Average Rank of related tags
Folksonomy
Logsonomy
16First insights
- Co-occurrence seems to have similar bias to
high-frequency tags, i.e., possibly to
hyperonyms. - Tag Context (and partially ResourceContext) seems
also to yield more synomyms and siblings - FolkRank noisy
- User Context mixed picture
- ? Now grounding of these observations in
WordNet.
17Agenda
- Logsonomy Introduction
- Dataset
- Query Term Similarity Measures
- Semantic Grounding
- Summary and Outlook
18Semantic Grounding in WordNet
- WordNet is a large lexical database for English.
- Words with same meaning are grouped in synsets,
which are ordered by an is-a hierarchy. - Introduction of single artificial root node
enables application of graph-based similarity
metrics between pairs of nuns / pairs of verbs. - Inclusion of top n del.icio.us tags in WordNet
19Example of Semantic Grounding
Wordnet Synset Hierarchy
- Original tag
- java
- Most similar tag
- cooc, folkrankprogramming
- TagContextpython
computers
programming
map
languages
design_patterns
Grounded similarity
java
python
20Shortest path between original tag and most
closely related one
Jiang-Conrath distance
Shortest path
Shown to be the semantically most adequate
measure for similarity within WordNet
Budanitsky, Hirst, 2006.
21Distribution of the lengts of shortest paths in
WordNet
Folksonomy Logsonomy
22Shortes path composition (length 1 and 2)
Folksonomy
Logsonomy
siblings
23Agenda
- Logsonomy Introduction
- Dataset
- Query Term Similarity Measures
- Semantic Grounding
- Summary and Outlook
24Summary
Similar network properties of folksonomies /
logsonomies
Application of term similarity measures to
logsonomies
Semantic Grounding of measures in WordNet
Comparison of measure characteristics with
folksonomies
Conclusions
25Summary Outlook
- Formalization into Logsonomies retains semantics
inherent in log data - Similarity measures from folksonomy analysis are
also able to extract synonyms / hyperonyms, but
partially different behaviour - Tag Context almost identical
- Resource Context less precise
- Cooccurrence influenced by logsonomy
construction (restoring compounds, ..) - Now possibly even more precise semantics by
integrating Folksonomies / Logsonomies?
26Similar tags live on www.bibsonomy.org
Thanks for your attention! contact
benz_at_cs.uni-kassel.de
27Appendix Music Genre Taxonomy learned from
last.fm
Music Genre Taxonomy learned from last.fm
28Level displacement in WordNet
level displacement to most related tag
29Qualitative insights Overlap of 10 most related
tags
Logsonomy/Folksonomy