Title: Mining for Enhanced Web Search
1Mining forEnhanced Web Search
- Ji-Rong Wen
- Media Management Group
- Microsoft Research Asia
2Outline
- Log Based Query Clustering
- Log Based Query Expansion
- Mining Community from Network
- Enhancing Web IR Using Page Segmentation
3Outline
- Log Based Query Clustering
- Log Based Query Expansion
- Mining Community from Network
- Enhancing Web IR Using Page Segmentation
4Motivation
- Needs of new generation of search engines
- Human editors want to find hot topics.
- Human indexers want to find hot query terms.
- Essentially a query clustering problem
- An identified cluster is viewed as a FAQ.
- Manual analysis of user queries is inefficient
- Build a clustering tool that automatically groups
similar queries from the user logs.
5AskJeeves Demo (1)
6AskJeeves Demo (2)
7AskJeeves Demo (3)
8Query space document space
Document space
Query space
Analysis of user interactions
Query clustering
Document clustering
9Our clustering principles
- Principle 1 (content based)
- Queries containing similar words are similar.
- Principle 2 (feedback based)
- Queries leading to common document clicks are
similar.
10Similarities Functions
Content based
Feedback based
11Weakness
- Content based approaches
- Word ambiguity
- Similar semantics, but (very) different syntax
- Short queries
- Feedback based approaches
- A clicked document is not necessarily relevant to
the query. - Documents usually contain more than one topic.
Query content or user feedback alone is not
sufficient to cluster truly semantic-related
queries.
12Combination of measures
- Query intentions are partially captured by both
query contents and user feedback - Linear combinations
- Sim(p, q) aSimcontent(p, q)b
Simfeedback(p, q) - a and b are set manually
13Example - sessions
- Q1 ltquery textgt law of thermodynamics
- ltclicked documentsgt ID 761571911 Title
Thermodynamics - ID 761571262 Title Conservation Laws
- Q2 ltquery textgt conservation laws
- ltclicked documentsgt ID 761571262 Title
Conservation Laws - ID 761571911 Title Thermodynamics
- Q3 ltquery textgt Newton law
- lt clicked documentsgt ID 761573959 Title
Newton, Sir Isaac - ID 761573872 Title Ballistics
- Q4 ltquery textgt Newton law
- lt clicked documentsgt ID 761556906 Title
Mechanics - ID 761556362 Title Gravitation
14Example - clustering results
Failed to be clustered
- Content based measure
- Cluster 1 law of thermodynamics (Q1)
- Cluster 2 conservation laws (Q2)
- Cluster 3 Newton law (Q3)
- Newton law (Q4)
- Feedback based measure
- Cluster 1 law of thermodynamics (Q1)
- conservation laws (Q2)
- Cluster 2 Newton law (Q3)
- Cluster 3 Newton law (Q4)
- Content feedback measure
- Cluster 1 law of thermodynamics (Q1)
- conservation laws (Q2)
- Cluster 2 Newton law (Q3)
- Newton law (Q4)
Expected result
15Query clustering process
16Evaluation
- Comparison of four kinds of similarity functions
- K-Simi keyword alone
- S-Simi single document alone
- KS-Simi keyword(0.5) single document(0.5)
- KH-Simi keyword(0.5) document hierarchy(0.5)
- Data
- 20GB raw IIS logs from the Encarta website
- ... address, time, query, clicks, ...
- 2,772,615 query sessions are extracted
- 20,000 randomly selected sessions
- Parameter setting
- Minimal number in a cluster (MinPts) is set to 3,
which means only a cluster contains at least 3
queries is taken as FAQ.
17Results
Precision
F-measure
Recall
18Conclusions
- User logs (user interactions) provide useful
information to deduce user query intentions and
to group similar queries. - Combination of content words and feedback
information can provide higher performance in
terms of precision and recall. - This method has wide applications on many search
engines.
19Outline
- Log Based Query Clustering
- Log Based Query Expansion
- Mining Community from Network
- Enhancing Web IR Using Page Segmentation
20Word mismatching
- The word mismatching problem of web searching
- Inconsistency of term usages between user queries
and documents - The Web is not well-organized
- Users express queries with their own vocabularies
- Very short queries (less than two words)
- Simple (key)word matching doesnt work well
21Big gap between the query space and the document
space
- Query space vs. document space.
- Query vector vs. document vector.
- Cosine similarity between query vector and
document vector
- Big gap
- 73.68 degree on average (Cosine measure 0.28)
22Exploiting query logs
- Query log a bridge to connect queries and
documents - Query session ltquery textgt clicked documents
- Log-based query expansion.
- Probabilistic correlations between query terms
and document terms - The correlations are then used to select high
quality expansion terms for new queries
23Compared with local feedback and relevance
feedback
24Query sessions as a bridge
Query Sessions
Document Space
Query Space
25Correlations between query terms and document
terms
Query Space
Document Space
0.83
0.89
0.24
0.67
0.04
0.17
26Term-term probabilistic correlations
- Term-Term Correlations are represented as the
conditional probability
27Term-term probabilistic correlations (cont.)
- Estimate of the two conditional probabilities.
28Query expansion based on term correlations
- For a new query, the following formula
- is used to select candidate expansion terms.
- Top ranked document terms are added into the
original query to formulate a new one.
29Characteristic of the log-based probabilistic
query expansion
- Local technique in general.
- Feasibility in computation.
- No initial retrieval.
- Reflecting most users intentions
- Evolve with the accumulations of user usages
30Evaluation
- Data
- Two month query logs (Oct 2000-Dem 2000)
- 41,942 documents
- 30 evaluation queries (mostly are short queries)
- Document relevance judged by human assessors.
- Comparing our method with the baseline and the
Local Context Analysis (LCA)
31 Experiment I retrieval effectiveness
Recall Baseline LC Exp On Log Exp
10 40.67 40.33(-0.82) 62.00(52.46)
20 26.83 33.33(24.22) 44.67(66.46)
30 21.56 27.00(25.26) 37.00(71.65)
40 17.75 23.08(30.05) 31.50(77.46)
50 15.07 20.40(35.40) 27.67(83.63)
60 13.00 17.89(37.61) 24.56(88.89)
70 11.43 16.29(42.50) 22.24(94.58)
80 10.17 15.08(48.36) 20.42(100.82)
90 9.44 13.96(47.84) 18.89(100.00)
100 8.70 13.07(50.19) 17.37(99.62)
Average 17.46 22.04(26.24) 30.63(75.42)
- Improvement
- 75.42 over Baseline
- 38.95 over LCA
32 Experiment II quality of expansion terms
- Examining 50 expansion terms obtained by the
log-based method and LCA.
LC Analysis (base) Log Based Improvement ()
Relevant Terms () 23.27 30.73 32.03
- Example Steve Jobs
- Apple Computer, CEO, Macintosh,
Microsoft, GUI, Personal Computers
33 Experiment III impact of phrases
- Phrases are extracted from user logs.
- For TREC queries, phrases may not be as effective
as expected. - Not the case for short query.
- Experiments show 11.37 improvement in average
when using phrases.
34(No Transcript)
35Summary of evaluation
- The log-based query expansion produces
significant improvements over both of the
baseline method and the LCA method. - Query expansion is of great importance for short
queries on the Web. - Phrases can improve the performance of web search.
36Conclusions
- We show how big the gap exists between the query
space and the document space. - A new log-based probabilistic query expansion
method is proposed to bridge the gap. - Experimental results show that our solution is
effectual, especially for short queries in Web
searching. - Log mining enhanced web searching is a very
promising direction.
37Outline
- Log Based Query Clustering
- Log Based Query Expansion
- Mining Community from Network
- Enhancing Web IR Using Page Segmentation
38Mining Knowledge and Structures from the Networks
- Network everywhere
- Information network
- Advise network
- Human network
-
- Sociometric representations
39Virtual Community
- Virtual community a concentric-circle model
- Discovering virtual communities and their
evolution in a network environment
40Discovering Virtual Communities
- Traditional clustering methods
- No order in a cluster
- An object can only belong to one cluster
(usually) - Need an accurate distance function
- Our method
- Using extended association-rule approach to
discover authoritative cluster - The core of a community is well represented by an
authoritative cluster - Expanding the core gradually
41The Algorithm
- Step 1 Finding candidate objects for cores
- Given an object set G and its link topology,
build adjacent matrix AG - Calculate hub and authority value for each object
- G is a most authoritative subset of G
- Step 2 Mining frequent itemsets
- m-itemset ? combination of m objects
- Similar to association rules, build 1 to m
frequent itemsets in G (5 is sufficient and
efficient)
42The Algorithm (Cont)
- Step 3 Constructing cores
- Merging similar itemsets ? (1st Pass Merging)
- Super-itemset must be semi-frequent
- Step 4 Building complete clusters
- Expanding cores with objects in G-G, according
to in-links - Merging similar clusters ? (2nd Pass Merging)
43Evolution of Communities
- If the link topology evolves, adding time axis to
the clustering process - Time sequence analysis of communities
44A Case Study Mining CS Interest Groups
- Data
- Paper
- Proceeding Paper from ACM Digital Library
- 60,000 Papers from 165 Proceedings
- Person
- CS Researchers from 264 Universities and labs
- 200,000 Pages Crawled, 22,000 Classified as
personal Homepage - Automatically discover interest groups and their
evolution in the computer science area - Emergence and evolution of a research topic
- Career of a researcher
- Value of a paper
45(No Transcript)
46Summary
- Features
- Discovering communities from network
automatically - Each community is organized in a gradual
structure - Every object can be included in multiple
communities - Depending on topology only
- Reflecting the evolution of communities
- Applications
- Our secret weapon to beat CiteSeer
- Consumer study in e-commerce
- A fundamental technique for network analysis
47Outline
- Log Based Query Clustering
- Log Based Query Expansion
- Mining Community from Network
- Enhancing Web IR Using Page Segmentation
48Motivation
- Low quality of Web pages
- Noisy decoration, interaction, contact info
- Multiple topics http//news.yahoo.com/
- Web page segmentation -gt filtering irrelevant
information - Traditional passage retrieval not considering
page structure - Dom-tree not designed for content organization,
but for presentation
49Vision-based Web Page Analysis
- Dom-tree is browse-oriented and doesnt directly
reflect semantic structure. - The 2-D layout of Web page reflects the
organization pattern in the designers mind. - A new algorithm to re-engineer page structure
based on visual clues.
50(No Transcript)
51Effect on Improving Web IR
- Comparison of several segmentation-based query
expansion methods
Number Baseline FULLDOC DOMPS VIPS WINPS COMBPS
3 16.55 17.56 (6.10) 16.75 18.15 (9.67) 17.03 17.45
5 16.55 17.46 (5.50) 16.76 19.62 (18.55) 18.62 17.30
10 16.55 19.10 (15.41) 16.67 19.81 (19.70) 18.53 18.99
20 16.55 17.89 (8.10) 17.71 20.66 (24.83) 20.38 20.64
30 16.55 17.40 (5.14) 18.20 20.47 (23.69) 19.85 21.35
40 16.55 15.50 (-6.34) 17.73 18.65 (12.69) 19.94 20.99
50 16.55 13.82 (-16.50) 18.46 17.41 (5.20) 19.23 19.97
52Comparison Chart
By combining vision-based and fix-length window
methods, we obtain the best retrieval performance
(gt 23) on the TREC9 and TREC10 Web data.
53Precisions at Different Recall Levels
Baseline FULLDOC (10) DOMPS (80) VIPS (20) WINPS (20) COMBPS (30)
Rel_ret (3363) Rel_ret (3363) 2049 2246 2333 2471 2391 2416
R E C A L L 0 58.55 62.04 (5.96) 57.94 58.69 60.53 60.48
R E C A L L 10 37.09 40.59 (9.44) 39.96 39.99 42.95 43.09
R E C A L L 20 28.13 30.43 (8.18) 31.61 31.96 33.45 35.51
R E C A L L 30 21.35 24.84 (16.35) 27.30 27.54 29.00 29.16
R E C A L L 40 16.94 21.19 (25.09) 22.93 23.62 23.32 24.87
R E C A L L 50 14.33 17.60 (22.82) 18.69 20.52 19.36 20.78
R E C A L L 60 10.61 13.33 (25.64) 12.98 16.07 14.40 15.82
R E C A L L 70 7.66 9.87 (28.85) 9.39 11.53 10.82 12.62
R E C A L L 80 5.96 6.65 (11.58) 6.65 8.26 7.32 7.76
R E C A L L 90 3.97 3.85 (-3.02) 3.59 4.65 4.35 4.67
R E C A L L 100 1.95 0.96 (-50.77) 1.06 1.86 1.79 1.98
R E C A L L Avg. 16.55 19.10 (15.41) 19.22 (16.13) 20.66 (24.83) 20.38 (23.14) 21.35 (29.00)
54Mining for Enhanced Web Search
- Words cited from the WISE02 Mining for Enhanced
Web Search workshop - We are currently drowning in data oceans and
facing serious data overload We foresee that
the biggest challenge in the next several decades
is how to effectively and efficiently dig out a
machine-understandable information and knowledge
layer from unorganized Web data Since the Web is
huge, heterogeneous and dynamic, automated Web
information and knowledge discovery calls for
novel technologies and toolsThe mined
information and knowledge will greatly improve
the effectiveness of current web searching and
enable much more sophisticated web information
retrieval technologies in the future.
55