Mining for Enhanced Web Search - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Mining for Enhanced Web Search

Description:

Mining for Enhanced Web Search Ji-Rong Wen Media Management Group Microsoft Research Asia Outline Log Based Query Clustering Log Based Query Expansion Mining ... – PowerPoint PPT presentation

Number of Views:201
Avg rating:3.0/5.0
Slides: 56
Provided by: hcui3
Category:

less

Transcript and Presenter's Notes

Title: Mining for Enhanced Web Search


1
Mining forEnhanced Web Search
  • Ji-Rong Wen
  • Media Management Group
  • Microsoft Research Asia

2
Outline
  • Log Based Query Clustering
  • Log Based Query Expansion
  • Mining Community from Network
  • Enhancing Web IR Using Page Segmentation

3
Outline
  • Log Based Query Clustering
  • Log Based Query Expansion
  • Mining Community from Network
  • Enhancing Web IR Using Page Segmentation

4
Motivation
  • Needs of new generation of search engines
  • Human editors want to find hot topics.
  • Human indexers want to find hot query terms.
  • Essentially a query clustering problem
  • An identified cluster is viewed as a FAQ.
  • Manual analysis of user queries is inefficient
  • Build a clustering tool that automatically groups
    similar queries from the user logs.

5
AskJeeves Demo (1)
6
AskJeeves Demo (2)
7
AskJeeves Demo (3)
8
Query space document space
Document space
Query space
Analysis of user interactions
Query clustering
Document clustering
9
Our clustering principles
  • Principle 1 (content based)
  • Queries containing similar words are similar.
  • Principle 2 (feedback based)
  • Queries leading to common document clicks are
    similar.

10
Similarities Functions
Content based
  • Keyword
  • Edit distance

Feedback based
  • Single document
  • Document hierarchy

11
Weakness
  • Content based approaches
  • Word ambiguity
  • Similar semantics, but (very) different syntax
  • Short queries
  • Feedback based approaches
  • A clicked document is not necessarily relevant to
    the query.
  • Documents usually contain more than one topic.

Query content or user feedback alone is not
sufficient to cluster truly semantic-related
queries.
12
Combination of measures
  • Query intentions are partially captured by both
    query contents and user feedback
  • Linear combinations
  • Sim(p, q) aSimcontent(p, q)b
    Simfeedback(p, q)
  • a and b are set manually

13
Example - sessions
  • Q1 ltquery textgt law of thermodynamics
  • ltclicked documentsgt ID 761571911 Title
    Thermodynamics
  • ID 761571262 Title Conservation Laws
  • Q2 ltquery textgt conservation laws
  • ltclicked documentsgt ID 761571262 Title
    Conservation Laws
  • ID 761571911 Title Thermodynamics
  • Q3 ltquery textgt Newton law
  • lt clicked documentsgt ID 761573959 Title
    Newton, Sir Isaac
  • ID 761573872 Title Ballistics
  • Q4 ltquery textgt Newton law
  • lt clicked documentsgt ID 761556906 Title
    Mechanics
  • ID 761556362 Title Gravitation

14
Example - clustering results
Failed to be clustered
  • Content based measure
  • Cluster 1 law of thermodynamics (Q1)
  • Cluster 2 conservation laws (Q2)
  • Cluster 3 Newton law (Q3)
  • Newton law (Q4)
  • Feedback based measure
  • Cluster 1 law of thermodynamics (Q1)
  • conservation laws (Q2)
  • Cluster 2 Newton law (Q3)
  • Cluster 3 Newton law (Q4)
  • Content feedback measure
  • Cluster 1 law of thermodynamics (Q1)
  • conservation laws (Q2)
  • Cluster 2 Newton law (Q3)
  • Newton law (Q4)

Expected result
15
Query clustering process
16
Evaluation
  • Comparison of four kinds of similarity functions
  • K-Simi keyword alone
  • S-Simi single document alone
  • KS-Simi keyword(0.5) single document(0.5)
  • KH-Simi keyword(0.5) document hierarchy(0.5)
  • Data
  • 20GB raw IIS logs from the Encarta website
  • ... address, time, query, clicks, ...
  • 2,772,615 query sessions are extracted
  • 20,000 randomly selected sessions
  • Parameter setting
  • Minimal number in a cluster (MinPts) is set to 3,
    which means only a cluster contains at least 3
    queries is taken as FAQ.

17
Results
Precision
F-measure
Recall
18
Conclusions
  • User logs (user interactions) provide useful
    information to deduce user query intentions and
    to group similar queries.
  • Combination of content words and feedback
    information can provide higher performance in
    terms of precision and recall.
  • This method has wide applications on many search
    engines.

19
Outline
  • Log Based Query Clustering
  • Log Based Query Expansion
  • Mining Community from Network
  • Enhancing Web IR Using Page Segmentation

20
Word mismatching
  • The word mismatching problem of web searching
  • Inconsistency of term usages between user queries
    and documents
  • The Web is not well-organized
  • Users express queries with their own vocabularies
  • Very short queries (less than two words)
  • Simple (key)word matching doesnt work well

21
Big gap between the query space and the document
space
  • Query space vs. document space.
  • Query vector vs. document vector.
  • Cosine similarity between query vector and
    document vector
  • Big gap
  • 73.68 degree on average (Cosine measure 0.28)

22
Exploiting query logs
  • Query log a bridge to connect queries and
    documents
  • Query session ltquery textgt clicked documents
  • Log-based query expansion.
  • Probabilistic correlations between query terms
    and document terms
  • The correlations are then used to select high
    quality expansion terms for new queries

23
Compared with local feedback and relevance
feedback
24
Query sessions as a bridge
Query Sessions
Document Space
Query Space
25
Correlations between query terms and document
terms
Query Space
Document Space
0.83
0.89
0.24
0.67
0.04
0.17
26
Term-term probabilistic correlations
  • Term-Term Correlations are represented as the
    conditional probability

27
Term-term probabilistic correlations (cont.)
  • Estimate of the two conditional probabilities.

28
Query expansion based on term correlations
  • For a new query, the following formula
  • is used to select candidate expansion terms.
  • Top ranked document terms are added into the
    original query to formulate a new one.

29
Characteristic of the log-based probabilistic
query expansion
  • Local technique in general.
  • Feasibility in computation.
  • No initial retrieval.
  • Reflecting most users intentions
  • Evolve with the accumulations of user usages

30
Evaluation
  • Data
  • Two month query logs (Oct 2000-Dem 2000)
  • 41,942 documents
  • 30 evaluation queries (mostly are short queries)
  • Document relevance judged by human assessors.
  • Comparing our method with the baseline and the
    Local Context Analysis (LCA)

31
Experiment I retrieval effectiveness
Recall Baseline LC Exp On Log Exp
10 40.67 40.33(-0.82) 62.00(52.46)
20 26.83 33.33(24.22) 44.67(66.46)
30 21.56 27.00(25.26) 37.00(71.65)
40 17.75 23.08(30.05) 31.50(77.46)
50 15.07 20.40(35.40) 27.67(83.63)
60 13.00 17.89(37.61) 24.56(88.89)
70 11.43 16.29(42.50) 22.24(94.58)
80 10.17 15.08(48.36) 20.42(100.82)
90 9.44 13.96(47.84) 18.89(100.00)
100 8.70 13.07(50.19) 17.37(99.62)
Average 17.46 22.04(26.24) 30.63(75.42)
  • Improvement
  • 75.42 over Baseline
  • 38.95 over LCA

32
Experiment II quality of expansion terms
  • Examining 50 expansion terms obtained by the
    log-based method and LCA.

LC Analysis (base) Log Based Improvement ()
Relevant Terms () 23.27 30.73 32.03
  • Example Steve Jobs
  • Apple Computer, CEO, Macintosh,
    Microsoft, GUI, Personal Computers

33
Experiment III impact of phrases
  • Phrases are extracted from user logs.
  • For TREC queries, phrases may not be as effective
    as expected.
  • Not the case for short query.
  • Experiments show 11.37 improvement in average
    when using phrases.

34
(No Transcript)
35
Summary of evaluation
  • The log-based query expansion produces
    significant improvements over both of the
    baseline method and the LCA method.
  • Query expansion is of great importance for short
    queries on the Web.
  • Phrases can improve the performance of web search.

36
Conclusions
  • We show how big the gap exists between the query
    space and the document space.
  • A new log-based probabilistic query expansion
    method is proposed to bridge the gap.
  • Experimental results show that our solution is
    effectual, especially for short queries in Web
    searching.
  • Log mining enhanced web searching is a very
    promising direction.

37
Outline
  • Log Based Query Clustering
  • Log Based Query Expansion
  • Mining Community from Network
  • Enhancing Web IR Using Page Segmentation

38
Mining Knowledge and Structures from the Networks
  • Network everywhere
  • Information network
  • Advise network
  • Human network
  • Sociometric representations

39
Virtual Community
  • Virtual community a concentric-circle model
  • Discovering virtual communities and their
    evolution in a network environment

40
Discovering Virtual Communities
  • Traditional clustering methods
  • No order in a cluster
  • An object can only belong to one cluster
    (usually)
  • Need an accurate distance function
  • Our method
  • Using extended association-rule approach to
    discover authoritative cluster
  • The core of a community is well represented by an
    authoritative cluster
  • Expanding the core gradually

41
The Algorithm
  • Step 1 Finding candidate objects for cores
  • Given an object set G and its link topology,
    build adjacent matrix AG
  • Calculate hub and authority value for each object
  • G is a most authoritative subset of G
  • Step 2 Mining frequent itemsets
  • m-itemset ? combination of m objects
  • Similar to association rules, build 1 to m
    frequent itemsets in G (5 is sufficient and
    efficient)

42
The Algorithm (Cont)
  • Step 3 Constructing cores
  • Merging similar itemsets ? (1st Pass Merging)
  • Super-itemset must be semi-frequent
  • Step 4 Building complete clusters
  • Expanding cores with objects in G-G, according
    to in-links
  • Merging similar clusters ? (2nd Pass Merging)

43
Evolution of Communities
  • If the link topology evolves, adding time axis to
    the clustering process
  • Time sequence analysis of communities

44
A Case Study Mining CS Interest Groups
  • Data
  • Paper
  • Proceeding Paper from ACM Digital Library
  • 60,000 Papers from 165 Proceedings
  • Person
  • CS Researchers from 264 Universities and labs
  • 200,000 Pages Crawled, 22,000 Classified as
    personal Homepage
  • Automatically discover interest groups and their
    evolution in the computer science area
  • Emergence and evolution of a research topic
  • Career of a researcher
  • Value of a paper

45
(No Transcript)
46
Summary
  • Features
  • Discovering communities from network
    automatically
  • Each community is organized in a gradual
    structure
  • Every object can be included in multiple
    communities
  • Depending on topology only
  • Reflecting the evolution of communities
  • Applications
  • Our secret weapon to beat CiteSeer
  • Consumer study in e-commerce
  • A fundamental technique for network analysis

47
Outline
  • Log Based Query Clustering
  • Log Based Query Expansion
  • Mining Community from Network
  • Enhancing Web IR Using Page Segmentation

48
Motivation
  • Low quality of Web pages
  • Noisy decoration, interaction, contact info
  • Multiple topics http//news.yahoo.com/
  • Web page segmentation -gt filtering irrelevant
    information
  • Traditional passage retrieval not considering
    page structure
  • Dom-tree not designed for content organization,
    but for presentation

49
Vision-based Web Page Analysis
  • Dom-tree is browse-oriented and doesnt directly
    reflect semantic structure.
  • The 2-D layout of Web page reflects the
    organization pattern in the designers mind.
  • A new algorithm to re-engineer page structure
    based on visual clues.

50
(No Transcript)
51
Effect on Improving Web IR
  • Comparison of several segmentation-based query
    expansion methods

Number Baseline FULLDOC DOMPS VIPS WINPS COMBPS
3 16.55 17.56 (6.10) 16.75 18.15 (9.67) 17.03 17.45
5 16.55 17.46 (5.50) 16.76 19.62 (18.55) 18.62 17.30
10 16.55 19.10 (15.41) 16.67 19.81 (19.70) 18.53 18.99
20 16.55 17.89 (8.10) 17.71 20.66 (24.83) 20.38 20.64
30 16.55 17.40 (5.14) 18.20 20.47 (23.69) 19.85 21.35
40 16.55 15.50 (-6.34) 17.73 18.65 (12.69) 19.94 20.99
50 16.55 13.82 (-16.50) 18.46 17.41 (5.20) 19.23 19.97
52
Comparison Chart
By combining vision-based and fix-length window
methods, we obtain the best retrieval performance
(gt 23) on the TREC9 and TREC10 Web data.
53
Precisions at Different Recall Levels
Baseline FULLDOC (10) DOMPS (80) VIPS (20) WINPS (20) COMBPS (30)
Rel_ret (3363) Rel_ret (3363) 2049 2246 2333 2471 2391 2416
R E C A L L 0 58.55 62.04 (5.96) 57.94 58.69 60.53 60.48
R E C A L L 10 37.09 40.59 (9.44) 39.96 39.99 42.95 43.09
R E C A L L 20 28.13 30.43 (8.18) 31.61 31.96 33.45 35.51
R E C A L L 30 21.35 24.84 (16.35) 27.30 27.54 29.00 29.16
R E C A L L 40 16.94 21.19 (25.09) 22.93 23.62 23.32 24.87
R E C A L L 50 14.33 17.60 (22.82) 18.69 20.52 19.36 20.78
R E C A L L 60 10.61 13.33 (25.64) 12.98 16.07 14.40 15.82
R E C A L L 70 7.66 9.87 (28.85) 9.39 11.53 10.82 12.62
R E C A L L 80 5.96 6.65 (11.58) 6.65 8.26 7.32 7.76
R E C A L L 90 3.97 3.85 (-3.02) 3.59 4.65 4.35 4.67
R E C A L L 100 1.95 0.96 (-50.77) 1.06 1.86 1.79 1.98
R E C A L L Avg. 16.55 19.10 (15.41) 19.22 (16.13) 20.66 (24.83) 20.38 (23.14) 21.35 (29.00)
54
Mining for Enhanced Web Search
  • Words cited from the WISE02 Mining for Enhanced
    Web Search workshop
  • We are currently drowning in data oceans and
    facing serious data overload We foresee that
    the biggest challenge in the next several decades
    is how to effectively and efficiently dig out a
    machine-understandable information and knowledge
    layer from unorganized Web data Since the Web is
    huge, heterogeneous and dynamic, automated Web
    information and knowledge discovery calls for
    novel technologies and toolsThe mined
    information and knowledge will greatly improve
    the effectiveness of current web searching and
    enable much more sophisticated web information
    retrieval technologies in the future.

55
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com