Mining for Enhanced Web Search - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Mining for Enhanced Web Search

Description:

Mining for Enhanced Web Search Ji-Rong Wen Media Management Group Microsoft Research Asia Outline Log Based Query Clustering Log Based Query Expansion Mining ... – PowerPoint PPT presentation

Number of Views:206

Avg rating:3.0/5.0

Slides: 56

Provided by: hcui3

Category:

more less

Transcript and Presenter's Notes

Title: Mining for Enhanced Web Search

1
Mining forEnhanced Web Search

Ji-Rong Wen
Media Management Group
Microsoft Research Asia

2
Outline

Log Based Query Clustering
Log Based Query Expansion
Mining Community from Network
Enhancing Web IR Using Page Segmentation

3
Outline

Log Based Query Clustering
Log Based Query Expansion
Mining Community from Network
Enhancing Web IR Using Page Segmentation

4
Motivation

Needs of new generation of search engines
Human editors want to find hot topics.
Human indexers want to find hot query terms.
Essentially a query clustering problem
An identified cluster is viewed as a FAQ.
Manual analysis of user queries is inefficient
Build a clustering tool that automatically groups
similar queries from the user logs.

5
AskJeeves Demo (1)
6
AskJeeves Demo (2)
7
AskJeeves Demo (3)
8
Query space document space
Document space
Query space
Analysis of user interactions
Query clustering
Document clustering
9
Our clustering principles

Principle 1 (content based)
Queries containing similar words are similar.
Principle 2 (feedback based)
Queries leading to common document clicks are
similar.

10
Similarities Functions
Content based

Keyword

Edit distance

Feedback based

Single document

Document hierarchy

11
Weakness

Content based approaches
Word ambiguity
Similar semantics, but (very) different syntax
Short queries

Feedback based approaches
A clicked document is not necessarily relevant to
the query.
Documents usually contain more than one topic.

Query content or user feedback alone is not
sufficient to cluster truly semantic-related
queries.
12
Combination of measures

Query intentions are partially captured by both
query contents and user feedback
Linear combinations
Sim(p, q) aSimcontent(p, q)b
Simfeedback(p, q)
a and b are set manually

13
Example - sessions

Q1 ltquery textgt law of thermodynamics
ltclicked documentsgt ID 761571911 Title
Thermodynamics
ID 761571262 Title Conservation Laws
Q2 ltquery textgt conservation laws
ltclicked documentsgt ID 761571262 Title
Conservation Laws
ID 761571911 Title Thermodynamics
Q3 ltquery textgt Newton law
lt clicked documentsgt ID 761573959 Title
Newton, Sir Isaac
ID 761573872 Title Ballistics
Q4 ltquery textgt Newton law
lt clicked documentsgt ID 761556906 Title
Mechanics
ID 761556362 Title Gravitation

14
Example - clustering results
Failed to be clustered

Content based measure
Cluster 1 law of thermodynamics (Q1)
Cluster 2 conservation laws (Q2)
Cluster 3 Newton law (Q3)
Newton law (Q4)

Feedback based measure
Cluster 1 law of thermodynamics (Q1)
conservation laws (Q2)
Cluster 2 Newton law (Q3)
Cluster 3 Newton law (Q4)

Content feedback measure
Cluster 1 law of thermodynamics (Q1)
conservation laws (Q2)
Cluster 2 Newton law (Q3)
Newton law (Q4)

Expected result
15
Query clustering process
16
Evaluation

Comparison of four kinds of similarity functions
K-Simi keyword alone
S-Simi single document alone
KS-Simi keyword(0.5) single document(0.5)
KH-Simi keyword(0.5) document hierarchy(0.5)
Data
20GB raw IIS logs from the Encarta website
... address, time, query, clicks, ...
2,772,615 query sessions are extracted
20,000 randomly selected sessions
Parameter setting
Minimal number in a cluster (MinPts) is set to 3,
which means only a cluster contains at least 3
queries is taken as FAQ.

17
Results
Precision
F-measure
Recall
18
Conclusions

User logs (user interactions) provide useful
information to deduce user query intentions and
to group similar queries.
Combination of content words and feedback
information can provide higher performance in
terms of precision and recall.
This method has wide applications on many search
engines.

19
Outline

Log Based Query Clustering
Log Based Query Expansion
Mining Community from Network
Enhancing Web IR Using Page Segmentation

20
Word mismatching

The word mismatching problem of web searching
Inconsistency of term usages between user queries
and documents
The Web is not well-organized
Users express queries with their own vocabularies
Very short queries (less than two words)
Simple (key)word matching doesnt work well

21
Big gap between the query space and the document
space

Query space vs. document space.
Query vector vs. document vector.
Cosine similarity between query vector and
document vector

Big gap
73.68 degree on average (Cosine measure 0.28)

22
Exploiting query logs

Query log a bridge to connect queries and
documents
Query session ltquery textgt clicked documents
Log-based query expansion.
Probabilistic correlations between query terms
and document terms
The correlations are then used to select high
quality expansion terms for new queries

23
Compared with local feedback and relevance
feedback
24
Query sessions as a bridge
Query Sessions
Document Space
Query Space
25
Correlations between query terms and document
terms
Query Space
Document Space
0.83
0.89
0.24
0.67
0.04
0.17
26
Term-term probabilistic correlations

Term-Term Correlations are represented as the
conditional probability

27
Term-term probabilistic correlations (cont.)

Estimate of the two conditional probabilities.

28
Query expansion based on term correlations

For a new query, the following formula

is used to select candidate expansion terms.
Top ranked document terms are added into the
original query to formulate a new one.

29
Characteristic of the log-based probabilistic
query expansion

Local technique in general.
Feasibility in computation.
No initial retrieval.
Reflecting most users intentions
Evolve with the accumulations of user usages

30
Evaluation

Data
Two month query logs (Oct 2000-Dem 2000)
41,942 documents
30 evaluation queries (mostly are short queries)
Document relevance judged by human assessors.
Comparing our method with the baseline and the
Local Context Analysis (LCA)

31
Experiment I retrieval effectiveness
Recall Baseline LC Exp On Log Exp
10 40.67 40.33(-0.82) 62.00(52.46)
20 26.83 33.33(24.22) 44.67(66.46)
30 21.56 27.00(25.26) 37.00(71.65)
40 17.75 23.08(30.05) 31.50(77.46)
50 15.07 20.40(35.40) 27.67(83.63)
60 13.00 17.89(37.61) 24.56(88.89)
70 11.43 16.29(42.50) 22.24(94.58)
80 10.17 15.08(48.36) 20.42(100.82)
90 9.44 13.96(47.84) 18.89(100.00)
100 8.70 13.07(50.19) 17.37(99.62)
Average 17.46 22.04(26.24) 30.63(75.42)

Improvement
75.42 over Baseline
38.95 over LCA

32
Experiment II quality of expansion terms

Examining 50 expansion terms obtained by the
log-based method and LCA.

LC Analysis (base) Log Based Improvement ()
Relevant Terms () 23.27 30.73 32.03

Example Steve Jobs
Apple Computer, CEO, Macintosh,
Microsoft, GUI, Personal Computers

33
Experiment III impact of phrases

Phrases are extracted from user logs.
For TREC queries, phrases may not be as effective
as expected.
Not the case for short query.
Experiments show 11.37 improvement in average
when using phrases.

34
(No Transcript)
35
Summary of evaluation

The log-based query expansion produces
significant improvements over both of the
baseline method and the LCA method.
Query expansion is of great importance for short
queries on the Web.
Phrases can improve the performance of web search.

36
Conclusions

We show how big the gap exists between the query
space and the document space.
A new log-based probabilistic query expansion
method is proposed to bridge the gap.
Experimental results show that our solution is
effectual, especially for short queries in Web
searching.
Log mining enhanced web searching is a very
promising direction.

37
Outline

Log Based Query Clustering
Log Based Query Expansion
Mining Community from Network
Enhancing Web IR Using Page Segmentation

38
Mining Knowledge and Structures from the Networks

Network everywhere
Information network
Advise network
Human network
Sociometric representations

39
Virtual Community

Virtual community a concentric-circle model
Discovering virtual communities and their
evolution in a network environment

40
Discovering Virtual Communities

Traditional clustering methods
No order in a cluster
An object can only belong to one cluster
(usually)
Need an accurate distance function
Our method
Using extended association-rule approach to
discover authoritative cluster
The core of a community is well represented by an
authoritative cluster
Expanding the core gradually

41
The Algorithm

Step 1 Finding candidate objects for cores
Given an object set G and its link topology,
build adjacent matrix AG
Calculate hub and authority value for each object
G is a most authoritative subset of G
Step 2 Mining frequent itemsets
m-itemset ? combination of m objects
Similar to association rules, build 1 to m
frequent itemsets in G (5 is sufficient and
efficient)

42
The Algorithm (Cont)

Step 3 Constructing cores
Merging similar itemsets ? (1st Pass Merging)
Super-itemset must be semi-frequent
Step 4 Building complete clusters
Expanding cores with objects in G-G, according
to in-links
Merging similar clusters ? (2nd Pass Merging)

43
Evolution of Communities

If the link topology evolves, adding time axis to
the clustering process
Time sequence analysis of communities

44
A Case Study Mining CS Interest Groups

Data
Paper
Proceeding Paper from ACM Digital Library
60,000 Papers from 165 Proceedings
Person
CS Researchers from 264 Universities and labs
200,000 Pages Crawled, 22,000 Classified as
personal Homepage
Automatically discover interest groups and their
evolution in the computer science area
Emergence and evolution of a research topic
Career of a researcher
Value of a paper

45
(No Transcript)
46
Summary

Features
Discovering communities from network
automatically
Each community is organized in a gradual
structure
Every object can be included in multiple
communities
Depending on topology only
Reflecting the evolution of communities
Applications
Our secret weapon to beat CiteSeer
Consumer study in e-commerce
A fundamental technique for network analysis

47
Outline

Log Based Query Clustering
Log Based Query Expansion
Mining Community from Network
Enhancing Web IR Using Page Segmentation

48
Motivation

Low quality of Web pages
Noisy decoration, interaction, contact info
Multiple topics http//news.yahoo.com/
Web page segmentation -gt filtering irrelevant
information
Traditional passage retrieval not considering
page structure
Dom-tree not designed for content organization,
but for presentation

49
Vision-based Web Page Analysis

Dom-tree is browse-oriented and doesnt directly
reflect semantic structure.
The 2-D layout of Web page reflects the
organization pattern in the designers mind.
A new algorithm to re-engineer page structure
based on visual clues.

50
(No Transcript)
51
Effect on Improving Web IR

Comparison of several segmentation-based query
expansion methods

Number Baseline FULLDOC DOMPS VIPS WINPS COMBPS
3 16.55 17.56 (6.10) 16.75 18.15 (9.67) 17.03 17.45
5 16.55 17.46 (5.50) 16.76 19.62 (18.55) 18.62 17.30
10 16.55 19.10 (15.41) 16.67 19.81 (19.70) 18.53 18.99
20 16.55 17.89 (8.10) 17.71 20.66 (24.83) 20.38 20.64
30 16.55 17.40 (5.14) 18.20 20.47 (23.69) 19.85 21.35
40 16.55 15.50 (-6.34) 17.73 18.65 (12.69) 19.94 20.99
50 16.55 13.82 (-16.50) 18.46 17.41 (5.20) 19.23 19.97
52
Comparison Chart
By combining vision-based and fix-length window
methods, we obtain the best retrieval performance
(gt 23) on the TREC9 and TREC10 Web data.
53
Precisions at Different Recall Levels
Baseline FULLDOC (10) DOMPS (80) VIPS (20) WINPS (20) COMBPS (30)
Rel_ret (3363) Rel_ret (3363) 2049 2246 2333 2471 2391 2416
R E C A L L 0 58.55 62.04 (5.96) 57.94 58.69 60.53 60.48
R E C A L L 10 37.09 40.59 (9.44) 39.96 39.99 42.95 43.09
R E C A L L 20 28.13 30.43 (8.18) 31.61 31.96 33.45 35.51
R E C A L L 30 21.35 24.84 (16.35) 27.30 27.54 29.00 29.16
R E C A L L 40 16.94 21.19 (25.09) 22.93 23.62 23.32 24.87
R E C A L L 50 14.33 17.60 (22.82) 18.69 20.52 19.36 20.78
R E C A L L 60 10.61 13.33 (25.64) 12.98 16.07 14.40 15.82
R E C A L L 70 7.66 9.87 (28.85) 9.39 11.53 10.82 12.62
R E C A L L 80 5.96 6.65 (11.58) 6.65 8.26 7.32 7.76
R E C A L L 90 3.97 3.85 (-3.02) 3.59 4.65 4.35 4.67
R E C A L L 100 1.95 0.96 (-50.77) 1.06 1.86 1.79 1.98
R E C A L L Avg. 16.55 19.10 (15.41) 19.22 (16.13) 20.66 (24.83) 20.38 (23.14) 21.35 (29.00)
54
Mining for Enhanced Web Search

Words cited from the WISE02 Mining for Enhanced
Web Search workshop
We are currently drowning in data oceans and
facing serious data overload We foresee that
the biggest challenge in the next several decades
is how to effectively and efficiently dig out a
machine-understandable information and knowledge
layer from unorganized Web data Since the Web is
huge, heterogeneous and dynamic, automated Web
information and knowledge discovery calls for
novel technologies and toolsThe mined
information and knowledge will greatly improve
the effectiveness of current web searching and
enable much more sophisticated web information
retrieval technologies in the future.